Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

email._header_value_parser does not recognise in-line encoding changes #65514

Closed
valhallasw mannequin opened this issue Apr 20, 2014 · 12 comments
Closed

email._header_value_parser does not recognise in-line encoding changes #65514

valhallasw mannequin opened this issue Apr 20, 2014 · 12 comments
Labels
3.7 (EOL) end of life 3.8 (EOL) end of life topic-email type-bug An unexpected behavior, bug, or error

Comments

@valhallasw
Copy link
Mannequin

valhallasw mannequin commented Apr 20, 2014

BPO 21315
Nosy @warsaw, @bitdancer, @maxking, @miss-islington
PRs
  • bpo-21315: Fix parsing of encoded words with missing leading ws. #13425
  • [3.7] bpo-21315: Fix parsing of encoded words with missing leading ws. (GH-13425) #13846
  • [3.8] bpo-21315: Fix parsing of encoded words with missing leading ws (GH-13425) #15655
  • Files
  • 000359.raw: Example bugzilla e-mail
  • unstructured_ew_without_whitespace.diff: Unit test & possible fix
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2019-08-17.03:03:39.233>
    created_at = <Date 2014-04-20.15:58:18.522>
    labels = ['3.8', 'type-bug', '3.7', 'expert-email']
    title = 'email._header_value_parser does not recognise in-line encoding changes'
    updated_at = <Date 2019-09-03.17:08:44.481>
    user = 'https://bugs.python.org/valhallasw'

    bugs.python.org fields:

    activity = <Date 2019-09-03.17:08:44.481>
    actor = 'miss-islington'
    assignee = 'none'
    closed = True
    closed_date = <Date 2019-08-17.03:03:39.233>
    closer = 'maxking'
    components = ['email']
    creation = <Date 2014-04-20.15:58:18.522>
    creator = 'valhallasw'
    dependencies = []
    files = ['34984', '34985']
    hgrepos = []
    issue_num = 21315
    keywords = ['patch']
    message_count = 12.0
    messages = ['216908', '238956', '342739', '342745', '342750', '342755', '342756', '343070', '343271', '344750', '344841', '351091']
    nosy_count = 5.0
    nosy_names = ['barry', 'r.david.murray', 'valhallasw', 'maxking', 'miss-islington']
    pr_nums = ['13425', '13846', '15655']
    priority = 'normal'
    resolution = None
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue21315'
    versions = ['Python 3.7', 'Python 3.8']

    @valhallasw
    Copy link
    Mannequin Author

    valhallasw mannequin commented Apr 20, 2014

    Bugzilla sends e-mail in a format where =?UTF-8 is not preceded by whitespace. This makes email.headerregistry.UnstructuredHeader (and email._header_value_parser on the background) not recognise the structure.

    >>> import email.headerregistry, pprint
    >>> x = {}; email.headerregistry.UnstructuredHeader.parse('[Bug 64155]\tNew:=?UTF-8?Q?=20non=2Dascii=20bug=20t=C3=A9st?=;\trussian text:=?UTF-8?Q?=20=D0=90=D0=91=D0=92=D0=93=D2=90=D0=94', x); pprint.pprint(x)
    {'decoded': '[Bug 64155]\tNew:=?UTF-8?Q?=20non=2Dascii=20bug=20t=C3=A9st?=;\t'
                'russian text:=?UTF-8?Q?=20=D0=90=D0=91=D0=92=D0=93=D2=90=D0=94',
     'parse_tree': UnstructuredTokenList([ValueTerminal('[Bug'), WhiteSpaceTerminal(' '), ValueTerminal('64155]'), WhiteSpaceTerminal('\t'), ValueTerminal('New:=?UTF-8?Q?=20non=2Dascii=20bug=20t=C3=A9st?=;'), WhiteSpaceTerminal('\t'), ValueTerminal('russian'), WhiteSpaceTerminal(' '), ValueTerminal('text:=?UTF-8?Q?=20=D0=90=D0=91=D0=92=D0=93=D2=90=D0=94')])}

    versus

    >>> x = {}; email.headerregistry.UnstructuredHeader.parse('[Bug 64155]\tNew: =?UTF-8?Q?=20non=2Dascii=20bug=20t=C3=A9st?=;\trussian text: =?UTF-8?Q?=20=D0=90=D0=91=D0=92=D0=93=D2=90=D0=94', x); pprint.pprint(x)
    {'decoded': '[Bug 64155]\tNew:  non-ascii bug tést;\trussian text:  АБВГҐД',
     'parse_tree': UnstructuredTokenList([ValueTerminal('[Bug'), WhiteSpaceTerminal(' '), ValueTerminal('64155]'), WhiteSpaceTerminal('\t'), ValueTerminal('New:'), WhiteSpaceTerminal(' '), EncodedWord([WhiteSpaceTerminal(' '), ValueTerminal('non-ascii'), WhiteSpaceTerminal(' '), ValueTerminal('bug'), WhiteSpaceTerminal(' '), ValueTerminal('tést')]), ValueTerminal(';'), WhiteSpaceTerminal('\t'), ValueTerminal('russian'), WhiteSpaceTerminal(' '), ValueTerminal('text:'), WhiteSpaceTerminal(' '), EncodedWord([WhiteSpaceTerminal(' '), ValueTerminal('АБВГҐД')])])}

    I have attached the raw e-mail as attachment.

    Judging by the code, this is supposed to work (while raising a Defect -- "missing whitespace before encoded word"), but the code splits by whitespace:

    tok, *remainder = _wsp_splitter(value, 1)

    which swallows the encoded section in one go. In a second attachment, I added a patch which 1) adds a test case for this and 2) implements a solution, but the solution is unfortunately not in the style of the rest of the module.

    In the meanwhile, I've chosen a monkey-patching approach to work around the issue:

    import email._header_value_parser, email.headerregistry
    def get_unstructured(value):
        value = value.replace("=?UTF-8?Q?=20", " =?UTF-8?Q?")
        return email._header_value_parser.get_unstructured(value)
    email.headerregistry.UnstructuredHeader.value_parser = staticmethod(get_unstructured)

    @valhallasw valhallasw mannequin added topic-email type-bug An unexpected behavior, bug, or error labels Apr 20, 2014
    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented Mar 22, 2015

    Could someone formally review the patch please, it's only three additional lines of code and a new test.

    @maxking
    Copy link
    Contributor

    maxking commented May 17, 2019

    According to RFC 2047 5(1)

    However, an 'encoded-word' that appears in a
    header field defined as '*text' MUST be separated from any adjacent
    'encoded-word' or 'text' by 'linear-white-space'.

    So, it seems like splitting on whitespace is the right thing to do (see MUST).

    While your solution works for your case where the charset and cte are utf-8 and q respectively (not a general case for random chatsets and cte), it seems like a hack to get around the fact the header is non-conformant to RFC.

    IMO manipulating the original header (value.replace in your patch) isn't something we should do, but @r.david.murray would be the right person to answer how we handle non-conformant messages.

    @bitdancer
    Copy link
    Member

    A cleaner/safer solution here would be:

    tok, *remainder = _wsp_splitter(value, 1)
    if _rfc2047_matcher(tok):
    tok, *remainder = value.partition('=?')

    where _rfc2047_matcher would be a regex that matches a correctly formatted encoded word. There a regex for that in the header.py module, though for this application we don't need the groups it has.

    Abhilash, I'm not sure why you say the proposed solution only works for utf-8 and 'q'?

    @maxking
    Copy link
    Contributor

    maxking commented May 17, 2019

    The solution replaces RFC 20147 chrome for utf-8 and q to make sure there is a space before ew, it wouldn't replace in case of any other charset/cte pair.

        value = value.replace("=?UTF-8?Q?=20", " =?UTF-8?Q?")

    Isn't that correct?

    @bitdancer
    Copy link
    Member

    I don't see that line of code in unstructured_ew_without_whitespace.diff.

    Oh, you are referring to his monkey patch. Yes, that is not a suitable solution for anyone but him, and I don't think he meant to imply otherwise :)

    @maxking
    Copy link
    Contributor

    maxking commented May 17, 2019

    Ah, I wrongly assumed the patch had the same thing.

    Sorry about that.

    @maxking
    Copy link
    Contributor

    maxking commented May 21, 2019

    Created a Pull Request for this.

    #13425

    @maxking
    Copy link
    Contributor

    maxking commented May 23, 2019

    I have made the requested changes on PR.

    @warsaw
    Copy link
    Member

    warsaw commented Jun 5, 2019

    New changeset 66c4f3f by Barry Warsaw (Abhilash Raj) in branch 'master':
    bpo-21315: Fix parsing of encoded words with missing leading ws. (bpo-13425)
    66c4f3f

    @warsaw
    Copy link
    Member

    warsaw commented Jun 6, 2019

    New changeset dc20fc4 by Barry Warsaw (Miss Islington (bot)) in branch '3.7':
    bpo-21315: Fix parsing of encoded words with missing leading ws. (GH-13425) (bpo-13846)
    dc20fc4

    @maxking maxking added 3.7 (EOL) end of life 3.8 (EOL) end of life labels Aug 17, 2019
    @maxking maxking closed this as completed Aug 17, 2019
    @miss-islington
    Copy link
    Contributor

    New changeset 59e8fba by Miss Islington (bot) (Ashwin Ramaswami) in branch '3.8':
    [3.8] bpo-21315: Fix parsing of encoded words with missing leading ws (GH-13425) (GH-15655)
    59e8fba

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 (EOL) end of life topic-email type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants