gh-108580: Structure for importlib metadata idents #108585

orbisvicis · 2023-08-28T20:42:33Z

Issue: Structure for importlib metadata identities #108580

cpython-cla-bot · 2023-08-28T20:42:35Z

All commit authors signed the Contributor License Agreement.

jaraco

Overall, this looks good. I'd like to see more test coverage and have a few comments on the approach, but in general this looks acceptable, with the biggest concern being the possible backward incompatible change to the protocol.

This change will need to be backported to importlib_metadata if it's not contributed to there first. It would be preferable to contribute it there first as the test suite there is more robust, covering different Python implementations, performance checks, coverage reports, and exercising doctests (among other things). It's also easier to develop and faster to iterate and experiment there. And changes there get released to the public quickly for rapid feedback. It's by no means a requirement, but there for your consideration. Anything that's contributed there gets merged into CPython usually in short order.

Lib/importlib/metadata/_adapters.py

jaraco · 2023-08-29T16:03:31Z

Lib/importlib/metadata/_meta.py

+    @property
+    def authors(self) -> set[Ident]:
+        """
+        Minimal parsing for "Author" and "Author-email" fields.
+        """
+
+    @property
+    def maintainers(self) -> set[Ident]:
+        """
+        Minimal parsing for "Maintainer" and "Maintainer-email" fields.
+        """


When these are introduced, they will invalidate the compliance with the Protocol for any implementations except the built-in one. I suspect other libraries are allowed to implement their own PackageMetadata. It may be okay to introduce this change, or it may be a backward-incompatibility that needs to be considered.

If it's not OK, what are the alternatives? I can leave the methods on the implementation but remove them from the protocol. Or I can move the new methods to PackageMetadataV2(PackageMetadata), a child protocol.

I don't know the best approach. I'm not sure how rigorous checks for protocol compliance are. Apparently this protocol isn't decorated with runtime_checkable, suggesting that it mainly serves as documentation and not runtime behavior.

Or I can move the new methods to PackageMetadataV2(PackageMetadata), a child protocol.

Eww. I don't like that at all. That can't possibly be the best way to evolve a protocol. I remember from C APIs that they were very struct about evolution, meaning that public APIs like the Win32 API would evolve the API once, from CallName to CallNameEx, meaning that after a year or two, the standard was for everyone to use the Ex calls.

As I think about it more, the V2 concept isn't horrible. It does actually model the evolution of the protocol.

A part of me wishes there was an abstraction between the underlying versions of the protocol and the current recommended protocol, e.g. PackageMetadataLatest except without the Latest wart.

I guess that's what bothers me the most is the name is now responsible for two concerns, what the protocol is attempting to model and what iteration it is.

Here's what I recommend - let's get this change released as drafted to importlib_metadata and CPython 3.13 alpha and see if there are any adverse effects and learn/adapt. I may be overthinking the concern.

Lib/importlib/metadata/_adapters.py

orbisvicis · 2023-09-09T14:20:38Z

This change will need to be backported to importlib_metadata if it's not contributed to there first. It would be preferable to contribute it there first as the test suite there is more robust

I've started backporting the changes to importlib_metadata and the tests there have definitely identified some issues that I'll have to fix. It's mainly formatting (I'll have to process the changes through black), but the hypothesis test definitely runs slower than it should. In the meantime, let's proceed with the review of whatever I have available now.

Lib/test/test_importlib/identity.py

jaraco · 2023-09-09T16:15:32Z

Lib/test/test_importlib/identity.py

@@ -0,0 +1,404 @@
+import re


This one file increases the size of the test suite for importlib.metadata by ~50%. That's a lot of complexity for a very small aspect of the code. Since it's just the tests, I'm not too concerned, but it makes me wonder how much of this module should be factored out into another library.

At the very least, I'd like one or two docstrings explaining what this module does and an overview of its purpose/function.

On further consideration, I think it was a bad choice to add hypothesis tests late in the PR. It's unnecessary to the implementation and possibly overkill and should be considered separately. Please split the hypothesis tests into a separate review with a separate justification. Is it possible to get the "example" tests to work without all of this infrastructure?

The hypothesis strategy is really clunky and complex, isn't it? But it did identify some edge cases that required me to rewrite both regexes, if you look at my commit history. I think that's justification enough. And believe it or not, the regexes are written to be simple and straightforward to understand. There's a lot of possible optimizations, such as loop unrolling, for which having this hypothesis strategy might eventually prove more useful.

Yeah, I guess that's a good justification. I do appreciate that the tests are noticeably enhanced by the adoption of hypothesis. Whether or not we split out the hypothesis concern, I'd definitely want to be sure it can be adopted upstream in importlib_metadata.

I'll separate out the Hypothesis identities strategy as a separate PR for importlib_metadata.

Lib/test/test_importlib/test_metadata_api.py

Lib/importlib/metadata/_adapters.py

bedevere-bot · 2023-09-10T14:26:13Z

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

orbisvicis · 2023-09-16T17:32:25Z

I have made the requested changes; please review again.

Zac-HD

👋 just dropping in to offer some comments on the Hypothesis strategy here, since it's one of the more ambitious I've seen and has led to several issues, stack overflow questions, and a recent Hypothesis release. I hope the review will be useful whether or not the property-based tests are ultimately included in this PR!

Lib/test/test_importlib/identity.py

orbisvicis · 2023-09-18T16:12:39Z

@Zac-HD Thanks for the feedback 👍. I've implemented everything but I'm still using function objects as categories just to reduce boilerplate. Unfortunately the tests now run more slowly, if I can trust that Hypothesis doesn't resume/shrink along the same slow path.

total=0; fail=0; for i in {0..100}; do ((total++)); tox -- tests/test_api.py::MetadataAPITests; status="$?"; echo -e "\n"; if ((status != 0)); then ((fail++)); fi; done; printf "Fail:  %s\nTotal: %s\n" "$fail" "$total"

Failure rates:

At the beginning: 90+%
After removing custom probabilities: 43%:
Now: 100% (edit: after deleting __pycache__, .hypothesis and .tox directories, failure rate is still 98%)

Summary of changes:

Work with strategies rather than data.
Draw directly from st.lists rather than drawing list length from st.integers.
Switch to draw(st.booleans()).
Move all functions to module scope (remove nesting).
Pass state via function parameters.

Add `PackageMetadata.authors` and `PackageMetadata.maintainers` to the `importlib.metadata` module. These unify and provide minimal parsing for the respective core metadata fields ("Author", "Author-email"), and ("Maintainer", "Maintainer-email").

Correct the parsing of name/email entries in the 'Author-email' and 'Maintainer-email' fields of the core metadata specification.

A hypothesis strategy for generating structured core metadata and equivalent unstructured text. Ensures that parsing the text using PackageMetadata results in the same structure - a roundtrip test.

A `note` stub which does nothing.

Correct the generation of identity entries in the core metadata specification.

Provide a name and docstring that better explains the test.

* Fix doctest string escapes * Ignore Ruff E501 (line-too-long) for a long multiline string used as test argument.

* Manipulate strategies rather than data - functions now return strategies instead of data. * Remove nesting by moving all functions to module scope. * Pass state via function parameters rather than using nonlocal variables. * Draw directly from 'st.lists' rather than drawing list length from 'st.integers'. * Switch to 'draw(st.booleans())'. * Remove 'merge_char_opts', which was over-engineered.

Required for: * 'blacklist_categories' -> 'exclude_categories' * 'blacklist_characters' -> 'exclude_characters'

Zac-HD

I suspect that much of the remaining failure rate is due to the modification of already-drawn structures. If you take that out, it might be faster? I'm definitely just guessing here though and would be pretty close to grabbing a profiler myself.

Zac-HD · 2023-09-19T00:58:06Z

Lib/test/test_importlib/identity.py

+        # if simple_chars, smaller search space and easier visualization
+        exclude_categories=("C",) if simple_chars else ("Cs", "Co", "Cn"),


I'd only support the simple case, since having two distinct options makes the database less effective and I don't expect control-character-specific bugs to be common, or for that matter worth fixing. It's a lot of state-management code for very little return.

I like having the option to test every possible combination, but I'll make simple_chars the default.

Lib/test/test_importlib/identity.py

orbisvicis · 2023-09-19T13:41:11Z

I suspect that much of the remaining failure rate is due to the modification of already-drawn structures. If you take that out, it might be faster? I'm definitely just guessing here though and would be pretty close to grabbing a profiler myself.

I'm not sure about that since I forgot to draw from unbalance, which is the main source of modifications. The other source, lstrip, doesn't do very much. What's interesting is that the performance decreased after my last batch of changes; that's got to be because I'm working with strategies rather than data, right?

Anyway, the profile doesn't show anything interesting though it does slow down the test enough that I can't get any successes. At this point I'm ready to chalk this up to "my slow computer". Besides CPython runs Hypothesis in a profile that suppresses hypothesis.HealthCheck.too_slow and for me the test finishes in 15 45 seconds maximum which isn't too bad.

[nav] In [19]: # python3 -m cProfile -o profile.pytest.fail.log -m pytest test_api.py::MetadataAPITests
          ...: # python3 -m cProfile -o profile.unittest.fail.log -m unittest tests.test_api.MetadataAPITests.test_structured_identity -vv
          ...: import cProfile
          ...: import pstats
          ...: p = pstats.Stats("profile.pytest.fail.log")
          ...: #p = pstats.Stats("profile.unittest.fail.log")
          ...: #p.strip_dirs().sort_stats("cumulative").print_stats("identity")
          ...: p.strip_dirs().sort_stats("cumulative").print_callees("identity")
   Ordered by: cumulative time
   List reduced from 4272 to 44 due to restriction <'identity'>

Function                                   called...
                                               ncalls  tottime  cumtime
test_api.py:345(test_structured_identity)  ->      74    0.000    0.000  _adapters.py:223(__iter__)
                                                  222    0.000    0.000  _adapters.py:224(<genexpr>)
                                                   16    0.000    0.014  _adapters.py:311(authors)
                                                   16    0.000    0.001  _adapters.py:318(maintainers)
                                                   64    0.000    0.000  control.py:152(note)
                                                   16    0.000    0.054  core.py:762(test)
                                                    1    0.000    1.700  core.py:1234(wrapped_test)
                                                   16    0.000    0.037  test_api.py:334(metadata_from_text)
identity.py:203(identities_strategy)       ->      60    0.000    0.000  <string>:4(identity_entries)
                                                   60    0.000    1.543  data.py:921(draw)
                                                   75    0.000    0.000  identity.py:233(<genexpr>)
identity.py:251(identity_entries)          ->      60    0.000    0.000  <string>:4(unbalance)
                                                   60    0.000    0.005  core.py:238(lists)
                                                   60    0.000    1.496  data.py:921(draw)
                                                   60    0.001    0.003  identity.py:265(process)
                                                   60    0.000    0.000  identity.py:288(lstrip)
                                                   30    0.000    0.029  identity.py:353(name_entry)
                                                   30    0.000    0.003  identity.py:373(ident_entry)
                                                   89    0.000    0.000  {method 'append' of 'list' objects}
identity.py:404(ident_name_addr)           ->      16    0.000    0.000  <string>:4(ident_addr)
                                                   16    0.000    0.000  <string>:4(ident_name)
                                                   16    0.000    0.000  <string>:4(ident_name_addr_delim)
                                                   48    0.000    0.633  data.py:921(draw)
identity.py:428(ident_addr)                ->      26    0.000    0.000  <string>:4(ident_addr_domain)
                                                   26    0.000    0.000  <string>:4(ident_addr_local)
                                                   52    0.000    0.001  core.py:149(booleans)
                                                  104    0.001    0.599  data.py:921(draw)
                                                   73    0.000    0.000  {built-in method builtins.len}
                                                   26    0.000    0.000  {method 'endswith' of 'str' objects}
                                                   26    0.000    0.000  {method 'startswith' of 'str' objects}
identity.py:475(quote)                     ->     524    0.000    0.002  __init__.py:225(compile)
                                                  262    0.000    0.006  core.py:178(sampled_from)
                                                  524    0.002    0.373  data.py:921(draw)
                                                  262    0.001    0.016  identity.py:103(text_exclude)
                                                  524    0.003    0.004  {method 'sub' of 're.Pattern' objects}
identity.py:419(ident_name)                ->      28    0.000    0.000  <string>:4(merge_sub)
                                                   28    0.000    0.000  <string>:4(quote)
                                                   28    0.000    0.003  core.py:238(lists)
                                                   56    0.000    0.373  data.py:921(draw)
                                                   28    0.000    0.005  identity.py:487(ident_name_other)
                                                   28    0.000    0.000  strategies.py:391(__or__)
identity.py:361(name_value)                ->      30    0.000    0.000  <string>:4(merge_sub)
                                                   30    0.000    0.000  <string>:4(quote)
                                                   30    0.000    0.002  core.py:238(lists)
                                                   60    0.000    0.334  data.py:921(draw)
                                                   30    0.000    0.000  identity.py:39(from_empty)
                                                   30    0.000    0.004  identity.py:504(name_value_other)
                                                   30    0.000    0.000  strategies.py:391(__or__)
identity.py:458(ident_addr_local)          ->      26    0.000    0.000  <string>:4(merge_sub)
                                                   26    0.000    0.000  <string>:4(quote)
                                                   26    0.000    0.003  core.py:238(lists)
                                                   52    0.000    0.303  data.py:921(draw)
                                                   26    0.000    0.007  identity.py:492(ident_addr_local_other)
                                                   26    0.000    0.000  strategies.py:391(__or__)
identity.py:465(ident_addr_domain)         ->      26    0.000    0.000  <string>:4(merge_sub)
                                                   26    0.000    0.000  <string>:4(quote)
                                                   26    0.000    0.003  core.py:238(lists)
                                                   52    0.000    0.263  data.py:921(draw)
                                                   26    0.000    0.006  identity.py:499(ident_addr_domain_other)
                                                   26    0.000    0.000  strategies.py:391(__or__)
identity.py:393(ident_addr_only)           ->      10    0.000    0.000  <string>:4(ident_addr)
                                                    3    0.000    0.000  <string>:4(ident_name_addr_delim)
                                                   10    0.000    0.000  core.py:149(booleans)
                                                   23    0.000    0.232  data.py:921(draw)
                                                   10    0.000    0.000  identity.py:39(from_empty)
identity.py:386(ident_name_only)           ->      12    0.000    0.000  <string>:4(ident_name)
                                                   12    0.000    0.155  data.py:921(draw)
                                                   12    0.000    0.000  identity.py:39(from_empty)
identity.py:115(merge_sub)                 ->     198    0.000    0.000  <string>:2(__eq__)
                                                   73    0.000    0.000  <string>:4(sub_entry)
                                                  117    0.001    0.032  data.py:921(draw)
                                                  110    0.000    0.002  identity.py:123(keyfunc)
                                                  311    0.000    0.000  {built-in method builtins.len}
                                                  157    0.000    0.000  {method 'append' of 'list' objects}
                                                  308    0.000    0.002  {method 'extend' of 'list' objects}
                                                  157    0.000    0.002  {method 'join' of 'str' objects}
identity.py:412(ident_name_addr_delim)     ->      19    0.000    0.000  core.py:767(text)
                                                   19    0.000    0.029  data.py:921(draw)
                                                   19    0.000    0.000  {method 'endswith' of 'str' objects}
identity.py:353(name_entry)                ->      30    0.000    0.000  <string>:4(name_value)
                                                   30    0.000    0.028  identity.py:357(name_empty)
                                                   30    0.000    0.000  strategies.py:391(__or__)
identity.py:357(name_empty)                ->      30    0.000    0.028  core.py:1063(builds)
                                                   60    0.000    0.001  misc.py:57(just)
identity.py:1(<module>)                    ->      16    0.000    0.027  core.py:1799(composite)
                                                    1    0.000    0.001  dataclasses.py:1202(dataclass)
                                                    2    0.000    0.000  {built-in method builtins.__build_class__}
identity.py:103(text_exclude)              ->     372    0.001    0.009  core.py:767(text)
                                                  372    0.001    0.016  identity.py:91(characters_exclude)
identity.py:91(characters_exclude)         -> 
identity.py:146(sub_entry)                 ->      73    0.000    0.001  __init__.py:178(sub)
                                                   73    0.000    0.006  identity.py:91(characters_exclude)
                                                   73    0.000    0.000  identity.py:111(generate_draw_function)
identity.py:492(ident_addr_local_other)    ->      26    0.000    0.003  core.py:1063(builds)
                                                   26    0.000    0.003  identity.py:103(text_exclude)
                                                   26    0.000    0.000  misc.py:57(just)
identity.py:499(ident_addr_domain_other)   ->      26    0.000    0.003  core.py:1063(builds)
                                                   26    0.000    0.003  identity.py:103(text_exclude)
                                                   26    0.000    0.000  misc.py:57(just)
identity.py:487(ident_name_other)          ->      28    0.000    0.003  core.py:1063(builds)
                                                   28    0.000    0.002  identity.py:103(text_exclude)
                                                   28    0.000    0.000  misc.py:57(just)
identity.py:504(name_value_other)          ->      30    0.000    0.002  core.py:1063(builds)
                                                   30    0.000    0.002  identity.py:103(text_exclude)
                                                   30    0.000    0.000  misc.py:57(just)
identity.py:154(sub_name)                  ->      98    0.000    0.001  __init__.py:178(sub)
                                                   44    0.000    0.001  __init__.py:225(compile)
                                                   44    0.000    0.002  identity.py:91(characters_exclude)
                                                   44    0.000    0.000  identity.py:111(generate_draw_function)
                                                   44    0.000    0.000  {method 'sub' of 're.Pattern' objects}
identity.py:123(keyfunc)                   ->     506    0.002    0.003  dataclasses.py:1453(replace)
identity.py:373(ident_entry)               ->      30    0.000    0.003  identity.py:382(ident_empty)
                                                   90    0.000    0.000  strategies.py:391(__or__)
identity.py:382(ident_empty)               ->      30    0.000    0.002  core.py:1063(builds)
                                                   60    0.000    0.001  misc.py:57(just)
identity.py:265(process)                   ->     124    0.000    0.001  identity.py:56(as_text)
                                                  124    0.001    0.001  identity.py:64(as_data)
                                                   68    0.000    0.000  identity.py:281(<genexpr>)
                                                  124    0.000    0.000  {built-in method builtins.all}
                                                   68    0.000    0.000  {method 'add' of 'set' objects}
identity.py:129(<genexpr>)                 ->     184    0.000    0.000  <string>:2(__eq__)
                                                  184    0.000    0.001  identity.py:123(keyfunc)
identity.py:64(as_data)                    ->     248    0.000    0.000  identity.py:72(keyfunc)
                                                  248    0.000    0.000  {method 'append' of 'list' objects}
                                                  248    0.000    0.001  {method 'join' of 'str' objects}
identity.py:39(from_empty)                 ->     324    0.000    0.000  {method 'append' of 'list' objects}
identity.py:56(as_text)                    ->     124    0.000    0.000  {method 'join' of 'str' objects}
identity.py:79(<genexpr>)                  ->     595    0.000    0.000  identity.py:72(keyfunc)
<string>:4(identity_entries)               ->      60    0.000    0.000  core.py:1784(accept)
                                                   60    0.000    0.000  utils.py:123(accept)
identity.py:62(<genexpr>)                  -> 
identity.py:72(keyfunc)                    -> 
identity.py:288(lstrip)                    ->      83    0.000    0.000  {method 'lstrip' of 'str' objects}
identity.py:281(<genexpr>)                 -> 
identity.py:111(generate_draw_function)    -> 
identity.py:233(<genexpr>)                 -> 
argparse.py:1789(identity)                 -> 
identity.py:31(Entry)                      -> 
identity.py:18(Token)                      ->

jaraco · 2023-09-21T09:24:21Z

After further consideration, I realized that maybe this feature shouldn't be implemented in importlib.metadata at all. In prior decisions (e.g. python/importlib_metadata#429 (comment)), the team has decided to make importlib metadata a low-level reflector of the metadata in a fairly raw form and defer interpretation and more sophisticated objects to a third-party library (namely packaging). Another example of this strategy is in how version() returns a primitive string instead of a packaging.version.Version.

If we wish to remain consistent in our approach, we should probably consider contributing this behavior (interpretation of Author/Maintainer contacts) to packaging or some other third-party library. Imagine for example:

from importlib.metadata import metadata
from packaging.contacts import parse_authors

authors = parse_authors(metadata('some-package'))

I regret not having made this connection earlier, as I know the contributor has invested a lot in this change.

I'll be honest, I don't love this strategy because of how it requires the callers to draw in the dependency and separately wrap the calls to arrive at high-level objects.

It does, however, have the advantage of keeping the low-level implementation simple and decoupling the more sophisticated objects into a single, third-party library.

In light of these concerns and past decisions, can we make the case that idents/contacts should get special treatment in this regard?

orbisvicis · 2023-09-21T13:12:29Z

Why does packaging have its own metadata parser?

orbisvicis · 2023-09-21T15:10:59Z

I have roughly the same objections as @jaraco, but I also don't want to be stuck between a refactoring of importlib.metadata and packaging, as there seems to be some overlapping functionality. I've also spent more time than I should on this, so I'd appreciate if the packaging maintainers could provide some feedback here and now regarding backporting this to packaging - @pradyunsg and @brettcannon (?). I also feel that Python should be able to parse its own metadata without drawing in third-party dependencies.

Class packaging.metadata.RawMetadata is just another low-level reflector of the metadata with just some light touch-ups - _parse_keywords and _parse_project_urls. It is just a TypedDict providing a 1:1 correspondence with core metadata fields and so isn't suitable for the additional properties I am proposing.

Class packaging.metadata.Metadata provides a much higher-level interface through _Validator conversion functions, but it currently only offers a 1:1 correspondence with core metadata fields. Since I am proposing a combined interface to multiple fields, I am not sure how my proposal will be received by packaging developers. The Metadata class also appears to be a work in progress - not yet in the pypa documentation.

Like I said, I'm not pleased with third-part dependencies, but the workflow is slightly more convoluted:

With this PR:

from importlib.metadata import packages_distributions, metadata

# The module name is more stable than the package name
pkg_metadata = metadata(packages_distributions()["some-importname"][0])
pkg_authors = pkg_metadata.authors

As properties on packaging.metadata.Metadata:

from importlib.metadata import packages_distributions, metadata
from packaging.metadata import Metadata

# The module name is more stable than the package name
pkg_metadata = Metadata.from_email(
    metadata(packages_distributions()["some-importname"][0])
)
authors = pkg_metadata.authors

Which would hook into a new module on packaging:

from importlib.metadata import packages_distributions, metadata
from packaging.identity import authors

# The module name is more stable than the package name
pkg_metadata = metadata(packages_distributions()["some-importname"][0])
pkg_authors = authors(pkg_metadata)

In light of these concerns and past decisions, can we make the case that idents/contacts should get special treatment in this regard?

That's a question for the team, but I'm noting that python/importlib_metadata#429, which became pypa/packaging#701, is still open.

brettcannon · 2023-09-21T19:08:23Z

I also feel that Python should be able to parse its own metadata without drawing in third-party dependencies.

To be clear, this is wheel metadata, not Python's own metadata (which is primarily reflected in the sys module). And technically Python doesn't care about wheel metadata in order to import something; if you left out the .dist-info directory then things keep working.

Why does packaging have its own metadata parser?

Because packaging has very different requirements in terms of Python version compatibility and release cadences that don't align with the stdlib.

Class packaging.metadata.RawMetadata is just another low-level reflector of the metadata with just some light touch-ups - _parse_keywords and _parse_project_urls. It is just a TypedDict providing a 1:1 correspondence with core metadata fields and so isn't suitable for the additional properties I am proposing.

Correct.

Class packaging.metadata.Metadata provides a much higher-level interface through _Validator conversion functions, but it currently only offers a 1:1 correspondence with core metadata fields.

Also correct.

Since I am proposing a combined interface to multiple fields, I am not sure how my proposal will be received by packaging developers.

We're happy with the current design, which went through multiple discussions over several months, as it focuses more on a simple API and validation than a heavily ergonomic, opinionated view of the metadata (i.e., it's meant to very much map directly to the underlying core metadata format). So I think providing some higher-level API would be best kept in a 3rd-party package.

The Metadata class also appears to be a work in progress - not yet in the pypa documentation.

It's done, it just hasn't been released yet (should be happening in the near future).

orbisvicis · 2023-09-22T13:06:09Z

@jaraco, do you want it as a new module in importlib.metadata? Otherwise I'll put it in one of my packages. But I think there's some benefit to minimizing the number of packages.

a heavily ergonomic, opinionated view of the metadata

The core metadata specification is sufficiently ambiguous that interpretation is required. Following my proposal at the Python packaging discourse, Structure for importlib metadata identities, I've tried to make as few assumptions as possible to maximize the space of inputs that can be correctly parsed.

To be clear, this is wheel metadata, not Python's own metadata

Aren't we striving for a world where the two are indistinguishable?

jaraco · 2023-09-22T15:30:48Z

So I think providing some higher-level API would be best kept in a 3rd-party package.

Ugh. What a mess we've gotten into. We've got two low-level metadata providers (importlib metadata and packaging) and one high-level provider (packaging), yet packaging isn't high-level enough to provide any ergonomic features, so we need yet another layer of third party package just for the ergonomics?

To be sure, packaging does provide some high-level ergonomics, like interpreting the meaning of a Version or Requirement or Specification, just not ergonomics that cross two fields.

Now that I see that packaging isn't a good home for this functionality, I'm more inclined to accept it within importlib metadata, so let's proceed with finding a way to accept it here. I don't want to be complicit in adding to the complication of the packaging ecosystem.

@jaraco, do you want it as a new module in importlib.metadata?

I'm unconcerned about the specific organization within importlib metadata. I do still want to be sure that this is contributed to importlib_metadata first (or at least simultaneously so that the porting work isn't left lingering).

To be clear, this is wheel metadata, not Python's own metadata

Aren't we striving for a world where the two are indistinguishable?

I don't think there's any plans to converge these two as they're unrelated domains of concern. Python's metadata doesn't have anything to do with packages (even in the stdlib).

To be sure, importlib_metadata and importlib.metadata are meant to be indistinguishable, and both deal with wheel metadata. I don't think Python metadata is relevant to this discussion.

orbisvicis · 2023-09-22T16:45:51Z

I do still want to be sure that this is contributed to importlib_metadata first (or at least simultaneously so that the porting work isn't left lingering).

There's an equivalent, flattened, PR already waiting to run at importlib_metadata: python/importlib_metadata#471. I might have to replace usage of the parameterized package (in this PR) if we're not committing the Hypothesis tests to CPython, but aside from that it's largely ready.

I'm more inclined to accept it within importlib metadata, so let's proceed with finding a way to accept it here.

I've noticed that package metadata might not be a solved problem, what with the multiple interfaces, multiple packages, and multiple PRs, but I think I have a interesting proposal that preserves PackageMetadata as a low-level reflector of metadata. We can create a provisional child class StructuredPackageMetadataV1(PackageMetadata) that will present an ergonomic interface for higher-level parsing functionality and the accompanying structured_metadata() function. My changes will go there and it's my hope that this class will become a grab-bag of various metadata parsing proposals with a lower bar of entry for pull requests. I think some of the best interfaces are those that grow organically within a limited scope and this facilitates those goals. After a few years, importlib_metadata can cull the class, keeping only the popular methods in a finalized StructuredPackageMetadataV2 class.

Then there's the opposite approach which isolates my PR as much as possible in a separate module. I'm not a fan of this approach but however it happens I do need the functionality for my own tooling.

from importlib.metadata import packages_distributions, metadata
from importlib.metadata.structure import authors

pkg_metadata = metadata(packages_distributions()["some-importname"][0])
pkg_authors = authors(pkg_metadata.authors)

brettcannon · 2023-09-22T20:50:21Z

The core metadata specification is sufficiently ambiguous that interpretation is required.

Then let's work towards fixing the ambiguity in the specs instead of trying to come up with our own interpretation via code. That's what leads to conventions being assumed to be standardized.

orbisvicis · 2023-09-23T14:30:38Z

I'm not sure that's a good idea. As I see it, the core metadata has several shortcomings:

poorly specified and custom format
inspired by the overly-complicated email format
email format has issues with internationalization and encoding
architectural issues such as similar keys with overlapping usages

If it were up to me I'd replace the custom format with TOML or JSON or some more structured format and deduplicate keys such as "Author", "Author-email" and "Maintainer", "Maintainer-email". As I understand it, there have been various PRs to enhance the ergonomics of the metadata API, so a large part of this project would be to collect feedback from the community.

In 2021 there was a discussion [1] about the encoding of non-ascii names and emails in which the core metadata identity fields "{name} <{email}> [were described as] intended to only be a guide, not that it should be the exact thing to use when formatting the field" and which led to a PR of the specification [2] active to this day describing the core metadata as "exactly which email RFC applies to packaging metadata is not specified" ... "The need for tools to work with years worth of existing packages makes it difficult to shift to a new format." (referring to PEP 566 from 2017, which describes a JSON-compatible version of the core metadata). This prompted a push [3] to better specify the existing core metadata format and which pushed packaging to implement a metadata API [4], eventually leading to PEP 685 which "specifies how to normalize distribution extra names when performing comparisons", a relatively narrow focus. I don't think a formal specification of the entire format in EBNF notation was ever finalized. In the meantime, the current toolchain (2023) processes multi-line identity fields in pyproject.toml to produce METADATA files that importlib.metadata cannot parse. This was caught by my Hypothesis test and which I'm not sure is a failure of the metadata specification or the toolchain, due to the complexity of the email format (I've just blacklisted newlines from the test). I wasn't aware of packaging at the time, but I'd be curious to see how its metadata parser performs. And let's not forget PEP 426 [5] from all the way back in 2012 which aimed to replace the core metadata with a JSON format but which was abandoned due to "no feasible migration plan". The original metadata specification (PEP 241) was created in 2001 and I'm sure I've only scratched the surface when researching the history of proposals intended to fix its shortcomings. From my perspective not much has changed, and throwing my hat in the ring would exceed the scope of what I am able to provide. Especially since I would propose an interface that merges or removes certain keys, which is more radical by far.

That said, I'd suggest the core metadata specification is replaced sooner rather than later, before the technical debt of all published Python packages becomes insurmountable.

[1] https://discuss.python.org/t/core-metadata-email-fields-unicode/7421
[2] https://github.com/pypa/packaging.python.org/pull/851/files
[3] https://discuss.python.org/t/python-metadata-format-specification-and-implementation/7550
[4] pypa/packaging#383
[5] https://peps.python.org/pep-0426/

brettcannon · 2023-09-25T22:18:44Z

FYI I'm unsubscribing as I'm not really seeing a need for any further input from me.

* Document `PackageMetadata.authors` and `PackageMetadata.maintainers` in the usage guide. * Export `Ident` so it can be documented but also used to build custom parsing strategies.

dstufft · 2023-09-27T23:26:41Z

I mentioned it on discuss.p.o, but for whatever it's worth, I think we should not be adding APIs to these core utilities like importlib.metadata or packaging that makes assumptions about the content of the metadata, beyond what the spec itself says.

If the spec isn't good enough, then we should improve the spec, not paper over it in code.

Properties `PackageMetadata.authors` and `PackageMetadata.maintainers` are now lists with repeated elements removed, rather than sets.

jaraco · 2023-12-14T18:54:56Z

Since there seems to be widespread consensus that importlib.metadata should reflect the spec, I took a look at the spec for Author and Author-email, which gives these examples:

Author: C. Schultz, Universal Features Syndicate,
        Los Angeles, CA <cschultz@peanuts.example.com>
Author-email: "C. Schultz" <cschultz@example.com>
Author-email: cschultz@example.com, snoopy@peanuts.com

Assuming the latter two are combined as:

Author-email: "C. Schultz" <cschultz@example.com>, cschultz@example.com, snoopy@peanuts.com

How would .author be interpreted for those values? Are these cases (or something similar) included in the test suite (if so, I don't see how)?

Based on the feedback above, if this proposed change provides a meaningful interpretation of what the spec prescribes, it's worth implementing, but it should avoid papering over ambiguity in the spec or otherwise creating a de facto spec by creating implicit constraints beyond what the spec provides.

jaraco · 2024-03-07T02:18:41Z

If it were up to me I'd replace the custom format with TOML or JSON or some more structured format

What about trying not to bite off so much and instead specify two fields, "Authors" and "Maintainers" with a specified structure that supports the desired use-cases. Or consider a "Contributor" field that contains any number of entities with their associated role (author, maintainer, etc). Is there anything about the email header format/syntax that would prohibit supplying all possible values?

Still, I understand if you're reluctant to go down that path and I'm comfortable allowing importlib.metadata to facilitate access to the aspects already specified.

How would .author be interpreted for those values? Are these cases (or something similar) included in the test suite (if so, I don't see how)?

Based on the feedback above, if this proposed change provides a meaningful interpretation of what the spec prescribes, it's worth implementing, but it should avoid papering over ambiguity in the spec or otherwise creating a de facto spec by creating implicit constraints beyond what the spec provides.

Have you thought any more about these questions and concerns? I feel like they're the primary blockers against accepting this change.

jaraco · 2024-04-19T17:10:25Z

In python/importlib_metadata#471 (comment), I'm diving into the details of this proposal and have identified some unexpected behaviors.

orbisvicis requested review from jaraco and warsaw as code owners August 28, 2023 20:42

bedevere-bot added the awaiting review label Aug 28, 2023

orbisvicis changed the title ~~Fix issue 108580~~ gh-108580: Structure for importlib metadata idents Aug 28, 2023

bedevere-bot mentioned this pull request Aug 28, 2023

Structure for importlib metadata identities #108580

Open

orbisvicis force-pushed the fix-issue-108580 branch from 9385c3b to 229c51f Compare August 28, 2023 20:51

jaraco reviewed Aug 29, 2023

View reviewed changes

orbisvicis requested review from brettcannon, ericsnowcurrently and ncoghlan as code owners September 6, 2023 03:09

brettcannon removed their request for review September 6, 2023 03:25

orbisvicis force-pushed the fix-issue-108580 branch from 2c3b72b to 9f46019 Compare September 9, 2023 04:56

jaraco requested changes Sep 10, 2023

View reviewed changes

bedevere-bot added awaiting changes and removed awaiting review labels Sep 10, 2023

orbisvicis mentioned this pull request Sep 13, 2023

Use of 'blacklist' and 'whitelist' in strategies.characters HypothesisWorks/hypothesis#3740

Closed

Zac-HD reviewed Sep 17, 2023

View reviewed changes

orbisvicis requested review from ezio-melotti and hugovk as code owners September 18, 2023 18:12

orbisvicis added 6 commits September 18, 2023 14:34

pythongh-108580: Correct parsing of core metadata

70d5326

Correct the parsing of name/email entries in the 'Author-email' and 'Maintainer-email' fields of the core metadata specification.

pythongh-108580: Add NEWS blurb

508022e

pythongh-108580: Hypothesis test to validate metadata

6eef33a

A hypothesis strategy for generating structured core metadata and equivalent unstructured text. Ensures that parsing the text using PackageMetadata results in the same structure - a roundtrip test.

pythongh-108580: Add note to _hypothesis_stubs

f0abeb9

A `note` stub which does nothing.

pythongh-108580: Correct generation of core metadata

5bb17c7

Correct the generation of identity entries in the core metadata specification.

orbisvicis added 6 commits September 18, 2023 14:34

Update the identity metadata test

e682ab6

Provide a name and docstring that better explains the test.

Apply black formatter

743a080

Pass 'importlib_metadata' tests

25ba84a

* Fix doctest string escapes * Ignore Ruff E501 (line-too-long) for a long multiline string used as test argument.

More docstrings for the identities strategies

e57e4bb

CI build workflow: require Hypothesis==6.85.0

80e05c6

Required for: * 'blacklist_categories' -> 'exclude_categories' * 'blacklist_characters' -> 'exclude_characters'

orbisvicis force-pushed the fix-issue-108580 branch from 777efc2 to 80e05c6 Compare September 18, 2023 18:41

Zac-HD reviewed Sep 19, 2023

View reviewed changes

Identities strategy: apply suggestions and fixes

3bb0966

orbisvicis mentioned this pull request Sep 19, 2023

Structured Metadata Identity python/importlib_metadata#471

Open

orbisvicis requested a review from jaraco September 19, 2023 18:48

Finalize the interface and add documentation

3775a2a

* Document `PackageMetadata.authors` and `PackageMetadata.maintainers` in the usage guide. * Export `Ident` so it can be documented but also used to build custom parsing strategies.

Preserve the order of identities.

f207420

Properties `PackageMetadata.authors` and `PackageMetadata.maintainers` are now lists with repeated elements removed, rather than sets.

Jacob-Stevens-Haas mentioned this pull request Dec 17, 2023

ENH: Check git.Repo.is_dirty() on all editable installs Jacob-Stevens-Haas/mitosis#17

Open

jaraco self-assigned this Apr 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-108580: Structure for importlib metadata idents #108585

gh-108580: Structure for importlib metadata idents #108585

orbisvicis commented Aug 28, 2023 •

edited

cpython-cla-bot bot commented Aug 28, 2023 •

edited

jaraco left a comment

jaraco Aug 29, 2023

orbisvicis Sep 8, 2023

jaraco Sep 9, 2023

orbisvicis commented Sep 9, 2023

jaraco Sep 9, 2023

jaraco Sep 9, 2023

orbisvicis Sep 13, 2023 •

edited

jaraco Sep 13, 2023

orbisvicis Sep 16, 2023

bedevere-bot commented Sep 10, 2023

orbisvicis commented Sep 16, 2023

Zac-HD left a comment

orbisvicis commented Sep 18, 2023 •

edited

Zac-HD left a comment

Zac-HD Sep 19, 2023

orbisvicis Sep 19, 2023 •

edited

orbisvicis commented Sep 19, 2023 •

edited

jaraco commented Sep 21, 2023

orbisvicis commented Sep 21, 2023

orbisvicis commented Sep 21, 2023

brettcannon commented Sep 21, 2023

orbisvicis commented Sep 22, 2023 •

edited

jaraco commented Sep 22, 2023

orbisvicis commented Sep 22, 2023 •

edited

brettcannon commented Sep 22, 2023

orbisvicis commented Sep 23, 2023

brettcannon commented Sep 25, 2023

dstufft commented Sep 27, 2023

jaraco commented Dec 14, 2023

jaraco commented Mar 7, 2024

jaraco commented Apr 19, 2024

		# if simple_chars, smaller search space and easier visualization
		exclude_categories=("C",) if simple_chars else ("Cs", "Co", "Cn"),

gh-108580: Structure for importlib metadata idents #108585

Are you sure you want to change the base?

gh-108580: Structure for importlib metadata idents #108585

Conversation

orbisvicis commented Aug 28, 2023 • edited

cpython-cla-bot bot commented Aug 28, 2023 • edited

jaraco left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

orbisvicis commented Sep 9, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

orbisvicis Sep 13, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bedevere-bot commented Sep 10, 2023

orbisvicis commented Sep 16, 2023

Zac-HD left a comment

Choose a reason for hiding this comment

orbisvicis commented Sep 18, 2023 • edited

Zac-HD left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

orbisvicis Sep 19, 2023 • edited

Choose a reason for hiding this comment

orbisvicis commented Sep 19, 2023 • edited

jaraco commented Sep 21, 2023

orbisvicis commented Sep 21, 2023

orbisvicis commented Sep 21, 2023

brettcannon commented Sep 21, 2023

orbisvicis commented Sep 22, 2023 • edited

jaraco commented Sep 22, 2023

orbisvicis commented Sep 22, 2023 • edited

brettcannon commented Sep 22, 2023

orbisvicis commented Sep 23, 2023

brettcannon commented Sep 25, 2023

dstufft commented Sep 27, 2023

jaraco commented Dec 14, 2023

jaraco commented Mar 7, 2024

jaraco commented Apr 19, 2024

orbisvicis commented Aug 28, 2023 •

edited

cpython-cla-bot bot commented Aug 28, 2023 •

edited

orbisvicis Sep 13, 2023 •

edited

orbisvicis commented Sep 18, 2023 •

edited

orbisvicis Sep 19, 2023 •

edited

orbisvicis commented Sep 19, 2023 •

edited

orbisvicis commented Sep 22, 2023 •

edited

orbisvicis commented Sep 22, 2023 •

edited