Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-67022: Document bytes/str inconsistency in email.header.decode_header() and add .decode_header_to_string() as a sane alternative #92900

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

dlenski
Copy link

@dlenski dlenski commented May 17, 2022

This function's possible return types have been surprising and error-prone
for the entirety of its Python 3.x history. It can return either:

typing.List[typing.Tuple[str, None]], of length exactly 1
or typing.List[typing.Tuple[bytes, typing.Optional[str]]]

This function can't be rewritten to be more consistent in a backwards-compatible way, because some users of this function depend on the existing return type(s).

This PR addresses the inconsistency as suggested by @JelleZijlstra in #67022 (comment):

  1. we should document the surprising return type at https://docs.python.org/3.10/library/email.header.html.

  2. [create] a new function with a sane return type.

The "sane", Pythonic way to handle the decoding of an email/MIME message header value is simply to convert the whole header to a str; the details of exactly which parts of that header were encoded in which charsets are not relevant to the users. Fortunately, the email.header module already contains a mechanism to do this, via the __str__ method of email.header.header, so we can simply create a wrapper function to guide users in the right direction.

Example of the old/inconsistent (decode_header) vs. new/sane (decode_header_to_string) functions:

>>> from email import decode_header, decode_header_to_string
>>>
>>> # Do most users care about this distinction in (sub)encodings? I think not.
>>> print(decode_header('hello =?utf-8?B?ZsOzbw==?= bar'))
[(b'hello ', None), (b'f\xc3\xb3o', 'utf-8'), (b' bar', None)]
>>> print(decode_header('=?iso-8859-1?q?hello_f=F3o_bar?='))
[(b'hello f\xf3o bar', 'iso-8859-1')]
>>>
>>> # Assuming not, this is a much saner interface
>>> print(decode_header_to_string('hello =?utf-8?B?ZsOzbw==?= bar'))
hello fóo bar
>>> print(decode_header_to_string('=?iso-8859-1?q?hello_f=F3o_bar?='))
hello fóo bar

(Closes #30548 and replaces it.)

@dlenski dlenski requested a review from a team as a code owner May 17, 2022 20:52
@cpython-cla-bot
Copy link

cpython-cla-bot bot commented May 17, 2022

All commit authors signed the Contributor License Agreement.
CLA signed

Copy link
Member

@warsaw warsaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I think this would help users of the legacy API, although I think we should also steer people to the new API. What does @bitdancer think?

Doc/library/email.header.rst Show resolved Hide resolved
.. note::

This function exists for for backwards compatibility only. For
new code we recommend using :mod:`email.header.decode_header_to_string`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding a link to the non-legacy API, or an example using that newer API?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the non-legacy API as described in https://docs.python.org/3/library/email.html, e.g. email.parser?

To my knowledge, there is not any function/method in that API which can be straightforwardly used instead of email.header.decode_header.

Copy link
Member

@bitdancer bitdancer Feb 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>>> from email.headerregistry import HeaderRegistry
>>> decoder = HeaderRegistry()
>>> decoder('To', '=?utf-8?q?M=C3=A4x?= <foo@bar.com>')
'Mäx <foo@bar.com>'
>>> decoder('To', '=?utf-8?q?M=C3=A4x?= <foo@bar.com>').addresses
(Address(display_name='Mäx', username='foo', domain='bar.com'),)

You really don't want to use the legacy decode_header. It has many bugs that the new API fixes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obviously this needs to be better documented...

Lib/email/header.py Show resolved Hide resolved
Lib/email/header.py Show resolved Hide resolved
@bedevere-bot
Copy link

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

And if you don't make the requested changes, you will be put in the comfy chair!

…de_header()

This function's possible return types have been surprising and error-prone
for the entirety of its Python 3.x history. It can return either:

1. `typing.List[typing.Tuple[bytes, typing.Optional[str]]]` of length >1
2. or `typing.List[typing.Tuple[str, None]]`, of length exactly 1

This means that any user of this function must be prepared to accept either
`bytes` or `str` for the first member of the 2-tuples it returns, which is a
very surprising behavior in Python 3.x, particularly given that the second
member of the tuple is supposed to represent the charset/encoding of the
first member.

This patch documents the behavior of this function, and adds test cases
to demonstrate it.

As discussed in bpo-22833, this cannot be changed in a backwards-compatible
way, and some users of this function depend precisely on the existing
behavior.
This function takes an email header, possibly with portions encoded
according to RFC2047, and converts it to a standard Python string.

It is intended to provide a sane, Pythonic replacement for
`email.header.decode_header()`, which has two major problems:

1. May return either bytes or str (bpo-22833/pythongh-67022), an
   inconsistent and error-prone interface
2. Exposes details of an email header value's encoding which
   most users will not care about or want to deal with. Many users
   likely just want to decode an email header value to a Python
   string.

It turns out that `email.header` already contained most of the code
necessary to do this, and providing `decode_header_to_string` as a
documented wrapper function points users in the right direction.
@dlenski
Copy link
Author

dlenski commented Jul 20, 2022

I have made the requested changes; please review again, @warsaw.

And if you don't make the requested changes, you will be put in the comfy chair!

😂

@bedevere-bot
Copy link

Thanks for making the requested changes!

@warsaw: please review the changes made to this pull request.

@bedevere-bot bedevere-bot requested a review from warsaw July 20, 2022 21:24
@dlenski
Copy link
Author

dlenski commented Feb 21, 2023

I have made the requested changes; please review again

@bedevere-bot
Copy link

Thanks for making the requested changes!

@warsaw: please review the changes made to this pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants