Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Email parser creates a message object that can't be flattened #76511

Closed
msapiro mannequin opened this issue Dec 15, 2017 · 7 comments
Closed

Email parser creates a message object that can't be flattened #76511

msapiro mannequin opened this issue Dec 15, 2017 · 7 comments
Labels
3.11 only security fixes 3.12 bugs and security fixes 3.13 new features, bugs and security fixes stdlib Python modules in the Lib dir topic-email type-bug An unexpected behavior, bug, or error

Comments

@msapiro
Copy link
Mannequin

msapiro mannequin commented Dec 15, 2017

BPO 32330
Nosy @warsaw, @msapiro, @bitdancer
PRs
  • bpo-32330 Replace invalid characters in email.message.set_payload(). #18059
  • Files
  • bad_email_2.eml: Sample message triggering issue
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2017-12-15.00:25:27.802>
    labels = ['3.8', 'type-bug', '3.7', 'expert-email', '3.9']
    title = "Email parser creates a message object that can't be flattened"
    updated_at = <Date 2020-01-19.06:38:24.901>
    user = 'https://github.com/msapiro'

    bugs.python.org fields:

    activity = <Date 2020-01-19.06:38:24.901>
    actor = 'msapiro'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['email']
    creation = <Date 2017-12-15.00:25:27.802>
    creator = 'msapiro'
    dependencies = []
    files = ['47333']
    hgrepos = []
    issue_num = 32330
    keywords = ['patch']
    message_count = 5.0
    messages = ['308353', '308361', '308362', '308395', '308421']
    nosy_count = 3.0
    nosy_names = ['barry', 'msapiro', 'r.david.murray']
    pr_nums = ['18059']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue32330'
    versions = ['Python 3.5', 'Python 3.6', 'Python 3.7', 'Python 3.8', 'Python 3.9']

    Linked PRs

    @msapiro
    Copy link
    Mannequin Author

    msapiro mannequin commented Dec 15, 2017

    This is related to https://bugs.python.org/issue27321 but a different exception is thrown for a different reason. This is caused by a defective spam message. I don't actually have the offending message from the wild, but the attached bad_email_2.eml illustrates the problem.

    The defect is the message declares the content charset as us-ascii, but the body contains non-ascii. When the message is parsed into an email.message.Message object and the objects as_string() method is called, UnicodeEncodeError is thrown as follows:

    >>> import email
    >>> with open('bad_email_2.eml', 'rb') as fp:
    ...     msg = email.message_from_binary_file(fp)
    ... 
    >>> msg.as_string()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/lib/python3.5/email/message.py", line 159, in as_string
        g.flatten(self, unixfrom=unixfrom)
      File "/usr/lib/python3.5/email/generator.py", line 115, in flatten
        self._write(msg)
      File "/usr/lib/python3.5/email/generator.py", line 181, in _write
        self._dispatch(msg)
      File "/usr/lib/python3.5/email/generator.py", line 214, in _dispatch
        meth(msg)
      File "/usr/lib/python3.5/email/generator.py", line 243, in _handle_text
        msg.set_payload(payload, charset)
      File "/usr/lib/python3.5/email/message.py", line 316, in set_payload
        payload = payload.encode(charset.output_charset)
    UnicodeEncodeError: 'ascii' codec can't encode characters in position 31-33: ordinal not in range(128)

    @msapiro msapiro mannequin added topic-email type-bug An unexpected behavior, bug, or error labels Dec 15, 2017
    @bitdancer
    Copy link
    Member

    What would you like to see happen in that situation? Should we use errors=replace like we do for headers? (That seems reasonable to me.)

    Note that it can be re-serialized as binary.

    @msapiro
    Copy link
    Mannequin Author

    msapiro mannequin commented Dec 15, 2017

    Yes. I think errors=replace is a good solution. In Mailman, we have our own mailman.email.message.Message class which is a subclass of email.message.Message and what we do to work around this and bpo-27321 is override as_string() with:

        def as_string(self):
            # Work around for https://bugs.python.org/issue27321 and
            # https://bugs.python.org/issue32330.
            try:
                value = email.message.Message.as_string(self)
            except (KeyError, UnicodeEncodeError):
                value = email.message.Message.as_bytes(self).decode(
                    'ascii', 'replace')
            return value

    @bitdancer
    Copy link
    Member

    I do wonder where you are using the string version of messages :)

    I actually thought I'd already done this (errors=replace), but obviously not. I don't have time now to work on a patch for this, and the patch in the other issue hasn't be updated to reflect the review I did :(

    @msapiro
    Copy link
    Mannequin Author

    msapiro mannequin commented Dec 15, 2017

    I do wonder where you are using the string version of messages :)

    Probably some places where we could use bytes, but one of the problem areas is where we save the content of a message held for moderation.

    @msapiro msapiro mannequin added 3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes labels Jan 19, 2020
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @iritkatriel iritkatriel added the stdlib Python modules in the Lib dir label Nov 23, 2023
    @serhiy-storchaka
    Copy link
    Member

    message_from_binary_file() uses BytesParser.parsebytes() which simply converts bytes to str using the ascii encoding and the surrogateescape error handler and calls Parser.parsestr(). So at some point all non-ASCII bytes are encoded with surrogate escapes. The payload is set in set_payload() with charset=None, so no re-encoding is done. Then in get_payload() the payload is encoded to bytes (restoring original non-ASCII bytes) and decoded back with the known charset and replacing non-decodable bytes with Unicode replacement characters \ufffd. Then set_payload() tries to encode them back, but fails because non-encodable \ufffd.

    If use the replace error handler here, \ufffd will be converted to ASCII ?. And the part of information will be lost, because ? used as the replacement character is not distinguishable from the regular ? character.

    @serhiy-storchaka
    Copy link
    Member

    I think that #116125 is more correct solution.

    @serhiy-storchaka serhiy-storchaka added 3.11 only security fixes 3.12 bugs and security fixes 3.13 new features, bugs and security fixes and removed 3.9 only security fixes 3.8 only security fixes 3.7 (EOL) end of life labels Feb 29, 2024
    miss-islington pushed a commit to miss-islington/cpython that referenced this issue Mar 5, 2024
    …ith ASCII charset (pythonGH-116125)
    
    (cherry picked from commit f97f25e)
    
    Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
    miss-islington pushed a commit to miss-islington/cpython that referenced this issue Mar 5, 2024
    …ith ASCII charset (pythonGH-116125)
    
    (cherry picked from commit f97f25e)
    
    Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
    serhiy-storchaka added a commit that referenced this issue Mar 5, 2024
    …with ASCII charset (GH-116125) (GH-116364)
    
    (cherry picked from commit f97f25e)
    
    Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
    serhiy-storchaka added a commit that referenced this issue Mar 5, 2024
    …with ASCII charset (GH-116125) (GH-116365)
    
    (cherry picked from commit f97f25e)
    
    Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
    adorilson pushed a commit to adorilson/cpython that referenced this issue Mar 25, 2024
    diegorusso pushed a commit to diegorusso/cpython that referenced this issue Apr 17, 2024
    mdtro added a commit to getsentry/sentry that referenced this issue Apr 25, 2024
    - bump python to 3.11.8 to resolve security vulnerabilities
    - bump gosu to 1.17 to resolve security vulnerabilities
    - bump Django to 5.0.4
    - bump self-hosted docker image to _bookworm_ (Debian 12)
    
    We [previously tried bumping to Python
    3.11.9](#69468), but ran into an
    odd unicode decoding error in
    getsentry/getsentry#13760 within our tests. See
    python/cpython#76511. Python 3.11.8 works.
    
    ---------
    
    Co-authored-by: getsantry[bot] <66042841+getsantry[bot]@users.noreply.github.com>
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.11 only security fixes 3.12 bugs and security fixes 3.13 new features, bugs and security fixes stdlib Python modules in the Lib dir topic-email type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants