Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-108580: Structure for importlib metadata idents #108585

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

orbisvicis
Copy link

@orbisvicis orbisvicis commented Aug 28, 2023

@cpython-cla-bot
Copy link

cpython-cla-bot bot commented Aug 28, 2023

All commit authors signed the Contributor License Agreement.
CLA signed

@orbisvicis orbisvicis changed the title Fix issue 108580 gh-108580: Structure for importlib metadata idents Aug 28, 2023
Copy link
Member

@jaraco jaraco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks good. I'd like to see more test coverage and have a few comments on the approach, but in general this looks acceptable, with the biggest concern being the possible backward incompatible change to the protocol.

This change will need to be backported to importlib_metadata if it's not contributed to there first. It would be preferable to contribute it there first as the test suite there is more robust, covering different Python implementations, performance checks, coverage reports, and exercising doctests (among other things). It's also easier to develop and faster to iterate and experiment there. And changes there get released to the public quickly for rapid feedback. It's by no means a requirement, but there for your consideration. Anything that's contributed there gets merged into CPython usually in short order.

Lib/importlib/metadata/_adapters.py Outdated Show resolved Hide resolved
Comment on lines 48 to 58
@property
def authors(self) -> set[Ident]:
"""
Minimal parsing for "Author" and "Author-email" fields.
"""

@property
def maintainers(self) -> set[Ident]:
"""
Minimal parsing for "Maintainer" and "Maintainer-email" fields.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When these are introduced, they will invalidate the compliance with the Protocol for any implementations except the built-in one. I suspect other libraries are allowed to implement their own PackageMetadata. It may be okay to introduce this change, or it may be a backward-incompatibility that needs to be considered.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not OK, what are the alternatives? I can leave the methods on the implementation but remove them from the protocol. Or I can move the new methods to PackageMetadataV2(PackageMetadata), a child protocol.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the best approach. I'm not sure how rigorous checks for protocol compliance are. Apparently this protocol isn't decorated with runtime_checkable, suggesting that it mainly serves as documentation and not runtime behavior.

Or I can move the new methods to PackageMetadataV2(PackageMetadata), a child protocol.

Eww. I don't like that at all. That can't possibly be the best way to evolve a protocol. I remember from C APIs that they were very struct about evolution, meaning that public APIs like the Win32 API would evolve the API once, from CallName to CallNameEx, meaning that after a year or two, the standard was for everyone to use the Ex calls.

As I think about it more, the V2 concept isn't horrible. It does actually model the evolution of the protocol.

A part of me wishes there was an abstraction between the underlying versions of the protocol and the current recommended protocol, e.g. PackageMetadataLatest except without the Latest wart.

I guess that's what bothers me the most is the name is now responsible for two concerns, what the protocol is attempting to model and what iteration it is.

Here's what I recommend - let's get this change released as drafted to importlib_metadata and CPython 3.13 alpha and see if there are any adverse effects and learn/adapt. I may be overthinking the concern.

Lib/importlib/metadata/_adapters.py Outdated Show resolved Hide resolved
Lib/importlib/metadata/_adapters.py Outdated Show resolved Hide resolved
@orbisvicis
Copy link
Author

This change will need to be backported to importlib_metadata if it's not contributed to there first. It would be preferable to contribute it there first as the test suite there is more robust

I've started backporting the changes to importlib_metadata and the tests there have definitely identified some issues that I'll have to fix. It's mainly formatting (I'll have to process the changes through black), but the hypothesis test definitely runs slower than it should. In the meantime, let's proceed with the review of whatever I have available now.

Lib/test/test_importlib/identity.py Outdated Show resolved Hide resolved
@@ -0,0 +1,404 @@
import re
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one file increases the size of the test suite for importlib.metadata by ~50%. That's a lot of complexity for a very small aspect of the code. Since it's just the tests, I'm not too concerned, but it makes me wonder how much of this module should be factored out into another library.

At the very least, I'd like one or two docstrings explaining what this module does and an overview of its purpose/function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On further consideration, I think it was a bad choice to add hypothesis tests late in the PR. It's unnecessary to the implementation and possibly overkill and should be considered separately. Please split the hypothesis tests into a separate review with a separate justification. Is it possible to get the "example" tests to work without all of this infrastructure?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hypothesis strategy is really clunky and complex, isn't it? But it did identify some edge cases that required me to rewrite both regexes, if you look at my commit history. I think that's justification enough. And believe it or not, the regexes are written to be simple and straightforward to understand. There's a lot of possible optimizations, such as loop unrolling, for which having this hypothesis strategy might eventually prove more useful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I guess that's a good justification. I do appreciate that the tests are noticeably enhanced by the adoption of hypothesis. Whether or not we split out the hypothesis concern, I'd definitely want to be sure it can be adopted upstream in importlib_metadata.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll separate out the Hypothesis identities strategy as a separate PR for importlib_metadata.

Lib/test/test_importlib/test_metadata_api.py Outdated Show resolved Hide resolved
Lib/test/test_importlib/test_metadata_api.py Outdated Show resolved Hide resolved
Lib/importlib/metadata/_adapters.py Outdated Show resolved Hide resolved
@bedevere-bot
Copy link

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

@orbisvicis
Copy link
Author

I have made the requested changes; please review again.

Copy link
Contributor

@Zac-HD Zac-HD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋 just dropping in to offer some comments on the Hypothesis strategy here, since it's one of the more ambitious I've seen and has led to several issues, stack overflow questions, and a recent Hypothesis release. I hope the review will be useful whether or not the property-based tests are ultimately included in this PR!

Lib/test/test_importlib/identity.py Outdated Show resolved Hide resolved
Lib/test/test_importlib/identity.py Outdated Show resolved Hide resolved
Lib/test/test_importlib/identity.py Outdated Show resolved Hide resolved
Lib/test/test_importlib/identity.py Outdated Show resolved Hide resolved
Lib/test/test_importlib/identity.py Outdated Show resolved Hide resolved
@orbisvicis
Copy link
Author

orbisvicis commented Sep 18, 2023

@Zac-HD Thanks for the feedback 👍. I've implemented everything but I'm still using function objects as categories just to reduce boilerplate. Unfortunately the tests now run more slowly, if I can trust that Hypothesis doesn't resume/shrink along the same slow path.

total=0; fail=0; for i in {0..100}; do ((total++)); tox -- tests/test_api.py::MetadataAPITests; status="$?"; echo -e "\n"; if ((status != 0)); then ((fail++)); fi; done; printf "Fail:  %s\nTotal: %s\n" "$fail" "$total"

Failure rates:

  • At the beginning: 90+%
  • After removing custom probabilities: 43%:
  • Now: 100% (edit: after deleting __pycache__, .hypothesis and .tox directories, failure rate is still 98%)

Summary of changes:

  • Work with strategies rather than data.
  • Draw directly from st.lists rather than drawing list length from st.integers.
  • Switch to draw(st.booleans()).
  • Move all functions to module scope (remove nesting).
  • Pass state via function parameters.

Add `PackageMetadata.authors` and `PackageMetadata.maintainers` to the
`importlib.metadata` module. These unify and provide minimal parsing for
the respective core metadata fields ("Author", "Author-email"), and
("Maintainer", "Maintainer-email").
Correct the parsing of name/email entries in the 'Author-email' and
'Maintainer-email' fields of the core metadata specification.
A hypothesis strategy for generating structured core metadata and
equivalent unstructured text. Ensures that parsing the text using
PackageMetadata results in the same structure - a roundtrip test.
A `note` stub which does nothing.
Correct the generation of identity entries in the core metadata
specification.
Provide a name and docstring that better explains the test.
* Fix doctest string escapes
* Ignore Ruff E501 (line-too-long) for a long multiline string used as
  test argument.
* Manipulate strategies rather than data - functions now return
  strategies instead of data.
* Remove nesting by moving all functions to module scope.
* Pass state via function parameters rather than using nonlocal
  variables.
* Draw directly from 'st.lists' rather than drawing list length from
  'st.integers'.
* Switch to 'draw(st.booleans())'.
* Remove 'merge_char_opts', which was over-engineered.
Required for:
* 'blacklist_categories' -> 'exclude_categories'
* 'blacklist_characters' -> 'exclude_characters'
Copy link
Contributor

@Zac-HD Zac-HD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that much of the remaining failure rate is due to the modification of already-drawn structures. If you take that out, it might be faster? I'm definitely just guessing here though and would be pretty close to grabbing a profiler myself.

Comment on lines +98 to +99
# if simple_chars, smaller search space and easier visualization
exclude_categories=("C",) if simple_chars else ("Cs", "Co", "Cn"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd only support the simple case, since having two distinct options makes the database less effective and I don't expect control-character-specific bugs to be common, or for that matter worth fixing. It's a lot of state-management code for very little return.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like having the option to test every possible combination, but I'll make simple_chars the default.

Lib/test/test_importlib/identity.py Outdated Show resolved Hide resolved
Lib/test/test_importlib/identity.py Outdated Show resolved Hide resolved
Lib/test/test_importlib/identity.py Outdated Show resolved Hide resolved
Lib/test/test_importlib/identity.py Show resolved Hide resolved
@orbisvicis
Copy link
Author

orbisvicis commented Sep 19, 2023

I suspect that much of the remaining failure rate is due to the modification of already-drawn structures. If you take that out, it might be faster? I'm definitely just guessing here though and would be pretty close to grabbing a profiler myself.

I'm not sure about that since I forgot to draw from unbalance, which is the main source of modifications. The other source, lstrip, doesn't do very much. What's interesting is that the performance decreased after my last batch of changes; that's got to be because I'm working with strategies rather than data, right?

Anyway, the profile doesn't show anything interesting though it does slow down the test enough that I can't get any successes. At this point I'm ready to chalk this up to "my slow computer". Besides CPython runs Hypothesis in a profile that suppresses hypothesis.HealthCheck.too_slow and for me the test finishes in 15 45 seconds maximum which isn't too bad.

[nav] In [19]: # python3 -m cProfile -o profile.pytest.fail.log -m pytest test_api.py::MetadataAPITests
          ...: # python3 -m cProfile -o profile.unittest.fail.log -m unittest tests.test_api.MetadataAPITests.test_structured_identity -vv
          ...: import cProfile
          ...: import pstats
          ...: p = pstats.Stats("profile.pytest.fail.log")
          ...: #p = pstats.Stats("profile.unittest.fail.log")
          ...: #p.strip_dirs().sort_stats("cumulative").print_stats("identity")
          ...: p.strip_dirs().sort_stats("cumulative").print_callees("identity")
   Ordered by: cumulative time
   List reduced from 4272 to 44 due to restriction <'identity'>

Function                                   called...
                                               ncalls  tottime  cumtime
test_api.py:345(test_structured_identity)  ->      74    0.000    0.000  _adapters.py:223(__iter__)
                                                  222    0.000    0.000  _adapters.py:224(<genexpr>)
                                                   16    0.000    0.014  _adapters.py:311(authors)
                                                   16    0.000    0.001  _adapters.py:318(maintainers)
                                                   64    0.000    0.000  control.py:152(note)
                                                   16    0.000    0.054  core.py:762(test)
                                                    1    0.000    1.700  core.py:1234(wrapped_test)
                                                   16    0.000    0.037  test_api.py:334(metadata_from_text)
identity.py:203(identities_strategy)       ->      60    0.000    0.000  <string>:4(identity_entries)
                                                   60    0.000    1.543  data.py:921(draw)
                                                   75    0.000    0.000  identity.py:233(<genexpr>)
identity.py:251(identity_entries)          ->      60    0.000    0.000  <string>:4(unbalance)
                                                   60    0.000    0.005  core.py:238(lists)
                                                   60    0.000    1.496  data.py:921(draw)
                                                   60    0.001    0.003  identity.py:265(process)
                                                   60    0.000    0.000  identity.py:288(lstrip)
                                                   30    0.000    0.029  identity.py:353(name_entry)
                                                   30    0.000    0.003  identity.py:373(ident_entry)
                                                   89    0.000    0.000  {method 'append' of 'list' objects}
identity.py:404(ident_name_addr)           ->      16    0.000    0.000  <string>:4(ident_addr)
                                                   16    0.000    0.000  <string>:4(ident_name)
                                                   16    0.000    0.000  <string>:4(ident_name_addr_delim)
                                                   48    0.000    0.633  data.py:921(draw)
identity.py:428(ident_addr)                ->      26    0.000    0.000  <string>:4(ident_addr_domain)
                                                   26    0.000    0.000  <string>:4(ident_addr_local)
                                                   52    0.000    0.001  core.py:149(booleans)
                                                  104    0.001    0.599  data.py:921(draw)
                                                   73    0.000    0.000  {built-in method builtins.len}
                                                   26    0.000    0.000  {method 'endswith' of 'str' objects}
                                                   26    0.000    0.000  {method 'startswith' of 'str' objects}
identity.py:475(quote)                     ->     524    0.000    0.002  __init__.py:225(compile)
                                                  262    0.000    0.006  core.py:178(sampled_from)
                                                  524    0.002    0.373  data.py:921(draw)
                                                  262    0.001    0.016  identity.py:103(text_exclude)
                                                  524    0.003    0.004  {method 'sub' of 're.Pattern' objects}
identity.py:419(ident_name)                ->      28    0.000    0.000  <string>:4(merge_sub)
                                                   28    0.000    0.000  <string>:4(quote)
                                                   28    0.000    0.003  core.py:238(lists)
                                                   56    0.000    0.373  data.py:921(draw)
                                                   28    0.000    0.005  identity.py:487(ident_name_other)
                                                   28    0.000    0.000  strategies.py:391(__or__)
identity.py:361(name_value)                ->      30    0.000    0.000  <string>:4(merge_sub)
                                                   30    0.000    0.000  <string>:4(quote)
                                                   30    0.000    0.002  core.py:238(lists)
                                                   60    0.000    0.334  data.py:921(draw)
                                                   30    0.000    0.000  identity.py:39(from_empty)
                                                   30    0.000    0.004  identity.py:504(name_value_other)
                                                   30    0.000    0.000  strategies.py:391(__or__)
identity.py:458(ident_addr_local)          ->      26    0.000    0.000  <string>:4(merge_sub)
                                                   26    0.000    0.000  <string>:4(quote)
                                                   26    0.000    0.003  core.py:238(lists)
                                                   52    0.000    0.303  data.py:921(draw)
                                                   26    0.000    0.007  identity.py:492(ident_addr_local_other)
                                                   26    0.000    0.000  strategies.py:391(__or__)
identity.py:465(ident_addr_domain)         ->      26    0.000    0.000  <string>:4(merge_sub)
                                                   26    0.000    0.000  <string>:4(quote)
                                                   26    0.000    0.003  core.py:238(lists)
                                                   52    0.000    0.263  data.py:921(draw)
                                                   26    0.000    0.006  identity.py:499(ident_addr_domain_other)
                                                   26    0.000    0.000  strategies.py:391(__or__)
identity.py:393(ident_addr_only)           ->      10    0.000    0.000  <string>:4(ident_addr)
                                                    3    0.000    0.000  <string>:4(ident_name_addr_delim)
                                                   10    0.000    0.000  core.py:149(booleans)
                                                   23    0.000    0.232  data.py:921(draw)
                                                   10    0.000    0.000  identity.py:39(from_empty)
identity.py:386(ident_name_only)           ->      12    0.000    0.000  <string>:4(ident_name)
                                                   12    0.000    0.155  data.py:921(draw)
                                                   12    0.000    0.000  identity.py:39(from_empty)
identity.py:115(merge_sub)                 ->     198    0.000    0.000  <string>:2(__eq__)
                                                   73    0.000    0.000  <string>:4(sub_entry)
                                                  117    0.001    0.032  data.py:921(draw)
                                                  110    0.000    0.002  identity.py:123(keyfunc)
                                                  311    0.000    0.000  {built-in method builtins.len}
                                                  157    0.000    0.000  {method 'append' of 'list' objects}
                                                  308    0.000    0.002  {method 'extend' of 'list' objects}
                                                  157    0.000    0.002  {method 'join' of 'str' objects}
identity.py:412(ident_name_addr_delim)     ->      19    0.000    0.000  core.py:767(text)
                                                   19    0.000    0.029  data.py:921(draw)
                                                   19    0.000    0.000  {method 'endswith' of 'str' objects}
identity.py:353(name_entry)                ->      30    0.000    0.000  <string>:4(name_value)
                                                   30    0.000    0.028  identity.py:357(name_empty)
                                                   30    0.000    0.000  strategies.py:391(__or__)
identity.py:357(name_empty)                ->      30    0.000    0.028  core.py:1063(builds)
                                                   60    0.000    0.001  misc.py:57(just)
identity.py:1(<module>)                    ->      16    0.000    0.027  core.py:1799(composite)
                                                    1    0.000    0.001  dataclasses.py:1202(dataclass)
                                                    2    0.000    0.000  {built-in method builtins.__build_class__}
identity.py:103(text_exclude)              ->     372    0.001    0.009  core.py:767(text)
                                                  372    0.001    0.016  identity.py:91(characters_exclude)
identity.py:91(characters_exclude)         -> 
identity.py:146(sub_entry)                 ->      73    0.000    0.001  __init__.py:178(sub)
                                                   73    0.000    0.006  identity.py:91(characters_exclude)
                                                   73    0.000    0.000  identity.py:111(generate_draw_function)
identity.py:492(ident_addr_local_other)    ->      26    0.000    0.003  core.py:1063(builds)
                                                   26    0.000    0.003  identity.py:103(text_exclude)
                                                   26    0.000    0.000  misc.py:57(just)
identity.py:499(ident_addr_domain_other)   ->      26    0.000    0.003  core.py:1063(builds)
                                                   26    0.000    0.003  identity.py:103(text_exclude)
                                                   26    0.000    0.000  misc.py:57(just)
identity.py:487(ident_name_other)          ->      28    0.000    0.003  core.py:1063(builds)
                                                   28    0.000    0.002  identity.py:103(text_exclude)
                                                   28    0.000    0.000  misc.py:57(just)
identity.py:504(name_value_other)          ->      30    0.000    0.002  core.py:1063(builds)
                                                   30    0.000    0.002  identity.py:103(text_exclude)
                                                   30    0.000    0.000  misc.py:57(just)
identity.py:154(sub_name)                  ->      98    0.000    0.001  __init__.py:178(sub)
                                                   44    0.000    0.001  __init__.py:225(compile)
                                                   44    0.000    0.002  identity.py:91(characters_exclude)
                                                   44    0.000    0.000  identity.py:111(generate_draw_function)
                                                   44    0.000    0.000  {method 'sub' of 're.Pattern' objects}
identity.py:123(keyfunc)                   ->     506    0.002    0.003  dataclasses.py:1453(replace)
identity.py:373(ident_entry)               ->      30    0.000    0.003  identity.py:382(ident_empty)
                                                   90    0.000    0.000  strategies.py:391(__or__)
identity.py:382(ident_empty)               ->      30    0.000    0.002  core.py:1063(builds)
                                                   60    0.000    0.001  misc.py:57(just)
identity.py:265(process)                   ->     124    0.000    0.001  identity.py:56(as_text)
                                                  124    0.001    0.001  identity.py:64(as_data)
                                                   68    0.000    0.000  identity.py:281(<genexpr>)
                                                  124    0.000    0.000  {built-in method builtins.all}
                                                   68    0.000    0.000  {method 'add' of 'set' objects}
identity.py:129(<genexpr>)                 ->     184    0.000    0.000  <string>:2(__eq__)
                                                  184    0.000    0.001  identity.py:123(keyfunc)
identity.py:64(as_data)                    ->     248    0.000    0.000  identity.py:72(keyfunc)
                                                  248    0.000    0.000  {method 'append' of 'list' objects}
                                                  248    0.000    0.001  {method 'join' of 'str' objects}
identity.py:39(from_empty)                 ->     324    0.000    0.000  {method 'append' of 'list' objects}
identity.py:56(as_text)                    ->     124    0.000    0.000  {method 'join' of 'str' objects}
identity.py:79(<genexpr>)                  ->     595    0.000    0.000  identity.py:72(keyfunc)
<string>:4(identity_entries)               ->      60    0.000    0.000  core.py:1784(accept)
                                                   60    0.000    0.000  utils.py:123(accept)
identity.py:62(<genexpr>)                  -> 
identity.py:72(keyfunc)                    -> 
identity.py:288(lstrip)                    ->      83    0.000    0.000  {method 'lstrip' of 'str' objects}
identity.py:281(<genexpr>)                 -> 
identity.py:111(generate_draw_function)    -> 
identity.py:233(<genexpr>)                 -> 
argparse.py:1789(identity)                 -> 
identity.py:31(Entry)                      -> 
identity.py:18(Token)                      -> 

@jaraco
Copy link
Member

jaraco commented Sep 21, 2023

After further consideration, I realized that maybe this feature shouldn't be implemented in importlib.metadata at all. In prior decisions (e.g. python/importlib_metadata#429 (comment)), the team has decided to make importlib metadata a low-level reflector of the metadata in a fairly raw form and defer interpretation and more sophisticated objects to a third-party library (namely packaging). Another example of this strategy is in how version() returns a primitive string instead of a packaging.version.Version.

If we wish to remain consistent in our approach, we should probably consider contributing this behavior (interpretation of Author/Maintainer contacts) to packaging or some other third-party library. Imagine for example:

from importlib.metadata import metadata
from packaging.contacts import parse_authors

authors = parse_authors(metadata('some-package'))

I regret not having made this connection earlier, as I know the contributor has invested a lot in this change.

I'll be honest, I don't love this strategy because of how it requires the callers to draw in the dependency and separately wrap the calls to arrive at high-level objects.

It does, however, have the advantage of keeping the low-level implementation simple and decoupling the more sophisticated objects into a single, third-party library.

In light of these concerns and past decisions, can we make the case that idents/contacts should get special treatment in this regard?

@orbisvicis
Copy link
Author

Why does packaging have its own metadata parser?

@orbisvicis
Copy link
Author

I have roughly the same objections as @jaraco, but I also don't want to be stuck between a refactoring of importlib.metadata and packaging, as there seems to be some overlapping functionality. I've also spent more time than I should on this, so I'd appreciate if the packaging maintainers could provide some feedback here and now regarding backporting this to packaging - @pradyunsg and @brettcannon (?). I also feel that Python should be able to parse its own metadata without drawing in third-party dependencies.

Class packaging.metadata.RawMetadata is just another low-level reflector of the metadata with just some light touch-ups - _parse_keywords and _parse_project_urls. It is just a TypedDict providing a 1:1 correspondence with core metadata fields and so isn't suitable for the additional properties I am proposing.

Class packaging.metadata.Metadata provides a much higher-level interface through _Validator conversion functions, but it currently only offers a 1:1 correspondence with core metadata fields. Since I am proposing a combined interface to multiple fields, I am not sure how my proposal will be received by packaging developers. The Metadata class also appears to be a work in progress - not yet in the pypa documentation.

Like I said, I'm not pleased with third-part dependencies, but the workflow is slightly more convoluted:

With this PR:

from importlib.metadata import packages_distributions, metadata

# The module name is more stable than the package name
pkg_metadata = metadata(packages_distributions()["some-importname"][0])
pkg_authors = pkg_metadata.authors

As properties on packaging.metadata.Metadata:

from importlib.metadata import packages_distributions, metadata
from packaging.metadata import Metadata

# The module name is more stable than the package name
pkg_metadata = Metadata.from_email(
    metadata(packages_distributions()["some-importname"][0])
)
authors = pkg_metadata.authors

Which would hook into a new module on packaging:

from importlib.metadata import packages_distributions, metadata
from packaging.identity import authors

# The module name is more stable than the package name
pkg_metadata = metadata(packages_distributions()["some-importname"][0])
pkg_authors = authors(pkg_metadata)

In light of these concerns and past decisions, can we make the case that idents/contacts should get special treatment in this regard?

That's a question for the team, but I'm noting that python/importlib_metadata#429, which became pypa/packaging#701, is still open.

@brettcannon
Copy link
Member

I also feel that Python should be able to parse its own metadata without drawing in third-party dependencies.

To be clear, this is wheel metadata, not Python's own metadata (which is primarily reflected in the sys module). And technically Python doesn't care about wheel metadata in order to import something; if you left out the .dist-info directory then things keep working.

Why does packaging have its own metadata parser?

Because packaging has very different requirements in terms of Python version compatibility and release cadences that don't align with the stdlib.

Class packaging.metadata.RawMetadata is just another low-level reflector of the metadata with just some light touch-ups - _parse_keywords and _parse_project_urls. It is just a TypedDict providing a 1:1 correspondence with core metadata fields and so isn't suitable for the additional properties I am proposing.

Correct.

Class packaging.metadata.Metadata provides a much higher-level interface through _Validator conversion functions, but it currently only offers a 1:1 correspondence with core metadata fields.

Also correct.

Since I am proposing a combined interface to multiple fields, I am not sure how my proposal will be received by packaging developers.

We're happy with the current design, which went through multiple discussions over several months, as it focuses more on a simple API and validation than a heavily ergonomic, opinionated view of the metadata (i.e., it's meant to very much map directly to the underlying core metadata format). So I think providing some higher-level API would be best kept in a 3rd-party package.

The Metadata class also appears to be a work in progress - not yet in the pypa documentation.

It's done, it just hasn't been released yet (should be happening in the near future).

@orbisvicis
Copy link
Author

orbisvicis commented Sep 22, 2023

@jaraco, do you want it as a new module in importlib.metadata? Otherwise I'll put it in one of my packages. But I think there's some benefit to minimizing the number of packages.

a heavily ergonomic, opinionated view of the metadata

The core metadata specification is sufficiently ambiguous that interpretation is required. Following my proposal at the Python packaging discourse, Structure for importlib metadata identities, I've tried to make as few assumptions as possible to maximize the space of inputs that can be correctly parsed.

To be clear, this is wheel metadata, not Python's own metadata

Aren't we striving for a world where the two are indistinguishable?

@jaraco
Copy link
Member

jaraco commented Sep 22, 2023

So I think providing some higher-level API would be best kept in a 3rd-party package.

Ugh. What a mess we've gotten into. We've got two low-level metadata providers (importlib metadata and packaging) and one high-level provider (packaging), yet packaging isn't high-level enough to provide any ergonomic features, so we need yet another layer of third party package just for the ergonomics?

To be sure, packaging does provide some high-level ergonomics, like interpreting the meaning of a Version or Requirement or Specification, just not ergonomics that cross two fields.

Now that I see that packaging isn't a good home for this functionality, I'm more inclined to accept it within importlib metadata, so let's proceed with finding a way to accept it here. I don't want to be complicit in adding to the complication of the packaging ecosystem.

@jaraco, do you want it as a new module in importlib.metadata?

I'm unconcerned about the specific organization within importlib metadata. I do still want to be sure that this is contributed to importlib_metadata first (or at least simultaneously so that the porting work isn't left lingering).

To be clear, this is wheel metadata, not Python's own metadata

Aren't we striving for a world where the two are indistinguishable?

I don't think there's any plans to converge these two as they're unrelated domains of concern. Python's metadata doesn't have anything to do with packages (even in the stdlib).

To be sure, importlib_metadata and importlib.metadata are meant to be indistinguishable, and both deal with wheel metadata. I don't think Python metadata is relevant to this discussion.

@orbisvicis
Copy link
Author

orbisvicis commented Sep 22, 2023

I do still want to be sure that this is contributed to importlib_metadata first (or at least simultaneously so that the porting work isn't left lingering).

There's an equivalent, flattened, PR already waiting to run at importlib_metadata: python/importlib_metadata#471. I might have to replace usage of the parameterized package (in this PR) if we're not committing the Hypothesis tests to CPython, but aside from that it's largely ready.

I'm more inclined to accept it within importlib metadata, so let's proceed with finding a way to accept it here.

I've noticed that package metadata might not be a solved problem, what with the multiple interfaces, multiple packages, and multiple PRs, but I think I have a interesting proposal that preserves PackageMetadata as a low-level reflector of metadata. We can create a provisional child class StructuredPackageMetadataV1(PackageMetadata) that will present an ergonomic interface for higher-level parsing functionality and the accompanying structured_metadata() function. My changes will go there and it's my hope that this class will become a grab-bag of various metadata parsing proposals with a lower bar of entry for pull requests. I think some of the best interfaces are those that grow organically within a limited scope and this facilitates those goals. After a few years, importlib_metadata can cull the class, keeping only the popular methods in a finalized StructuredPackageMetadataV2 class.

Then there's the opposite approach which isolates my PR as much as possible in a separate module. I'm not a fan of this approach but however it happens I do need the functionality for my own tooling.

from importlib.metadata import packages_distributions, metadata
from importlib.metadata.structure import authors

pkg_metadata = metadata(packages_distributions()["some-importname"][0])
pkg_authors = authors(pkg_metadata.authors)

@brettcannon
Copy link
Member

The core metadata specification is sufficiently ambiguous that interpretation is required.

Then let's work towards fixing the ambiguity in the specs instead of trying to come up with our own interpretation via code. That's what leads to conventions being assumed to be standardized.

@orbisvicis
Copy link
Author

I'm not sure that's a good idea. As I see it, the core metadata has several shortcomings:

  • poorly specified and custom format
  • inspired by the overly-complicated email format
  • email format has issues with internationalization and encoding
  • architectural issues such as similar keys with overlapping usages

If it were up to me I'd replace the custom format with TOML or JSON or some more structured format and deduplicate keys such as "Author", "Author-email" and "Maintainer", "Maintainer-email". As I understand it, there have been various PRs to enhance the ergonomics of the metadata API, so a large part of this project would be to collect feedback from the community.

In 2021 there was a discussion [1] about the encoding of non-ascii names and emails in which the core metadata identity fields "{name} <{email}> [were described as] intended to only be a guide, not that it should be the exact thing to use when formatting the field" and which led to a PR of the specification [2] active to this day describing the core metadata as "exactly which email RFC applies to packaging metadata is not specified" ... "The need for tools to work with years worth of existing packages makes it difficult to shift to a new format." (referring to PEP 566 from 2017, which describes a JSON-compatible version of the core metadata). This prompted a push [3] to better specify the existing core metadata format and which pushed packaging to implement a metadata API [4], eventually leading to PEP 685 which "specifies how to normalize distribution extra names when performing comparisons", a relatively narrow focus. I don't think a formal specification of the entire format in EBNF notation was ever finalized. In the meantime, the current toolchain (2023) processes multi-line identity fields in pyproject.toml to produce METADATA files that importlib.metadata cannot parse. This was caught by my Hypothesis test and which I'm not sure is a failure of the metadata specification or the toolchain, due to the complexity of the email format (I've just blacklisted newlines from the test). I wasn't aware of packaging at the time, but I'd be curious to see how its metadata parser performs. And let's not forget PEP 426 [5] from all the way back in 2012 which aimed to replace the core metadata with a JSON format but which was abandoned due to "no feasible migration plan". The original metadata specification (PEP 241) was created in 2001 and I'm sure I've only scratched the surface when researching the history of proposals intended to fix its shortcomings. From my perspective not much has changed, and throwing my hat in the ring would exceed the scope of what I am able to provide. Especially since I would propose an interface that merges or removes certain keys, which is more radical by far.

That said, I'd suggest the core metadata specification is replaced sooner rather than later, before the technical debt of all published Python packages becomes insurmountable.

[1] https://discuss.python.org/t/core-metadata-email-fields-unicode/7421
[2] https://github.com/pypa/packaging.python.org/pull/851/files
[3] https://discuss.python.org/t/python-metadata-format-specification-and-implementation/7550
[4] pypa/packaging#383
[5] https://peps.python.org/pep-0426/

@brettcannon
Copy link
Member

FYI I'm unsubscribing as I'm not really seeing a need for any further input from me.

* Document `PackageMetadata.authors` and `PackageMetadata.maintainers`
  in the usage guide.
* Export `Ident` so it can be documented but also used to build custom
  parsing strategies.
@dstufft
Copy link
Member

dstufft commented Sep 27, 2023

I mentioned it on discuss.p.o, but for whatever it's worth, I think we should not be adding APIs to these core utilities like importlib.metadata or packaging that makes assumptions about the content of the metadata, beyond what the spec itself says.

If the spec isn't good enough, then we should improve the spec, not paper over it in code.

Properties `PackageMetadata.authors` and `PackageMetadata.maintainers`
are now lists with repeated elements removed, rather than sets.
@jaraco
Copy link
Member

jaraco commented Dec 14, 2023

Since there seems to be widespread consensus that importlib.metadata should reflect the spec, I took a look at the spec for Author and Author-email, which gives these examples:

Author: C. Schultz, Universal Features Syndicate,
        Los Angeles, CA <cschultz@peanuts.example.com>
Author-email: "C. Schultz" <cschultz@example.com>
Author-email: cschultz@example.com, snoopy@peanuts.com

Assuming the latter two are combined as:

Author-email: "C. Schultz" <cschultz@example.com>, cschultz@example.com, snoopy@peanuts.com

How would .author be interpreted for those values? Are these cases (or something similar) included in the test suite (if so, I don't see how)?

Based on the feedback above, if this proposed change provides a meaningful interpretation of what the spec prescribes, it's worth implementing, but it should avoid papering over ambiguity in the spec or otherwise creating a de facto spec by creating implicit constraints beyond what the spec provides.

@jaraco
Copy link
Member

jaraco commented Mar 7, 2024

If it were up to me I'd replace the custom format with TOML or JSON or some more structured format

What about trying not to bite off so much and instead specify two fields, "Authors" and "Maintainers" with a specified structure that supports the desired use-cases. Or consider a "Contributor" field that contains any number of entities with their associated role (author, maintainer, etc). Is there anything about the email header format/syntax that would prohibit supplying all possible values?

Still, I understand if you're reluctant to go down that path and I'm comfortable allowing importlib.metadata to facilitate access to the aspects already specified.

How would .author be interpreted for those values? Are these cases (or something similar) included in the test suite (if so, I don't see how)?

Based on the feedback above, if this proposed change provides a meaningful interpretation of what the spec prescribes, it's worth implementing, but it should avoid papering over ambiguity in the spec or otherwise creating a de facto spec by creating implicit constraints beyond what the spec provides.

Have you thought any more about these questions and concerns? I feel like they're the primary blockers against accepting this change.

@jaraco jaraco self-assigned this Apr 19, 2024
@jaraco
Copy link
Member

jaraco commented Apr 19, 2024

In python/importlib_metadata#471 (comment), I'm diving into the details of this proposal and have identified some unexpected behaviors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants