Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide mapping from "Python packages" to "distribution packages" #131

Closed
jaraco opened this issue Oct 22, 2020 · 7 comments · Fixed by #287
Closed

Provide mapping from "Python packages" to "distribution packages" #131

jaraco opened this issue Oct 22, 2020 · 7 comments · Fixed by #287
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@jaraco
Copy link
Member

jaraco commented Oct 22, 2020

In GitLab by @8day on Oct 16, 2020, 11:12

Apart from Distribution.from_name() there must exist Distribution.from_package_name(). E.g., pkg_resources from setuptools can't be found with importlib_metadata.Distribution.from_name('pkg_resources'). Another example is PyYAML (dist name) that contains yaml (package name).

Most likely for this to work dists will need Provides-Dist: {dist}:{pkg} to be defined in their metadata. Considering that it seems that nobody uses a proper solution, ATM can be implemented as a hack: check if *.dist-info/top_level.txt exists and *.dist-info/INSTALLER contains pip, then use contents of *.dist-info/top_level.txt to list "import packages" contained by "distribution package".

This will require Package to be added, which will complicate things quite a bit: e.g., Package.files() will have to return files stored within package, and not entire distribution. Also, this will require adjustment of terminology, seeing as ATM "package" is more or less the same as "distribution package".

Note that this will require reading of metadata of all found dists, which will be extremely inefficient.

All of this can be avoided by a switch to another, mono-package dist format and preferably metadata format, but that's a topic for another discussion...

Edit:
I could sware I've read about notation like Provides-Dist: {dist}:{pkg} in one of the PEPs, but can't find any sources...

@jaraco
Copy link
Member Author

jaraco commented Oct 22, 2020

In GitLab by @8day on Oct 16, 2020, 13:33

changed the description

@jaraco
Copy link
Member Author

jaraco commented Jan 10, 2021

Thanks for the report @8day. Indeed, when I started implementing this package, my instinct was the same as yours, that it should be straightforward for a (Python) package to be able to resolve itself to its distribution, or conversely to resolve from a distribution all the packages it exposes.

But soon after delving into the implementation, I abandoned the idea of attempting to solve the problem of resolving distributions to their packages and vice-versa and instead focused on enabling the basic use-case of resolving metadata for a (usually installed) distribution.

Therefore, a distribution like PyYAML, which exposes a yaml module, would still need to use metadata.distribution('PyYAML') to get metadata for that distribution package. This project does not try to provide a mapping.

Note also that packages that only expose a module are also affected (path.py <-> path module).

In an early design, I had proposed that Python packages that wished to declare their distribution package could do so with something like:

__distribution__ = 'PyYAML'

Or for something like pkg_resources/__init__.py:

__distribution__ = 'setuptools'

But as I mentioned above, I decided the value this would add was small compared to the confusion.


I do agree it would be nice to have a reliable protocol to determine packages/modules for a distribution and a relevant distributions for a module/package. Note that with namespace packages, it's a many-to-many relationship, as module:backports could map to configparser and backports.pdb and more, depending on the install.

This effort isn't something this project has plans to tackle, but if you have interest in driving the design, consensus, and implementation, I'd be willing to advise on the process.

I'd start by providing more background on what use-cases are unmet by the current implementation. Can you elaborate on what use-cases you encountered that inspired you to write the report?

@jaraco jaraco changed the title No way to detect "import packages" in multi-package "distribution packages" Provide mapping from "Python packages" to "distribution packages" Jan 10, 2021
@jaraco jaraco added enhancement New feature or request help wanted Extra attention is needed labels Jan 10, 2021
@8day
Copy link

8day commented Jan 12, 2021

Can you elaborate on what use-cases you encountered that inspired you to write the report?

I think I faced this problem while designing dependency checker. Users were forced to specify entire Distribution Package as a dependency when it was an Import Package that was required, which was confusing. I.e., difference in "import name"/"name in code" and "install name" could require analysis of install files. Note that this mattered only because this had to be a fully automated task, and not something that has human oversight.

Can't get this out of my head: it seems like all of this is about "Import Package metadata". E.g., __name__ and other import-related module attributes, __version__, __doc__ imply that "Import Package metadata" is there, it just remains nameless, and some missing bits must be filled with Distribution Package metadata.

@8day
Copy link

8day commented Jan 13, 2021

I think we are trying to solve the problem that once have been solved. I've re-read PEPs for metadata, and finally saw why I thought about Provides-Dist: {dist}:{pkg} (I guess I was looking for this exact notation and didn't read text properly):

(from PEP 345 Provides-Dist)

Each entry contains a string naming a Distutils project which is contained within this distribution. This field must include the project identified in the Name field, followed by the version : Name (Version).

A distribution may provide additional names, e.g. to indicate that multiple projects have been bundled together. For instance, source distributions of the ZODB project have historically included the transaction project, which is now available as a separate distribution. Installing such a source distribution satisfies requirements for both ZODB and transaction.

A distribution may also provide a "virtual" project name, which does not correspond to any separately-distributed project: such a name might be used to indicate an abstract capability which could be supplied by one of multiple projects. E.g., multiple projects might supply RDBMS bindings for use by a given ORM: each project might declare that it provides ORM-bindings, allowing other projects to depend only on having at most one of them installed.

As can be seen, Provides-Dist indeed allows to specify Import Packages, but not like {dist}:{pkg} -- as a {pkg}. There's also a solution for namespace packages and standalone scripts: such format will look like {super_module}.{module}. (Also, in case if some people don't know this, it shows that Provides-Dist inspired creation of Provides-Extra, to which such feature was moved.) This is supported even by example shown for Requires-Dist: Requires-Dist: zope.interface (>3.5.0). This is more evident if you go as far back as Provides:

Each entry contains a string describing a package or module that will be provided by this package once it is installed. These strings should match the ones used in Requirements fields. A version declaration may be supplied (without a comparison operator); the package's version number will be implied if none is specified.

Note that Provides allowed to define version for individual Import Packages, or "imply" it (most likely use Distribution Package's version defined under Version), which covers very important case.

I.e., what must be done is for package managers to support all of this. Thus, probably there's even no need for a new protocol/PEP: all of this just has to be implemented by importlib.metadata.

I think metadata format requires proper analysis to clear up lots of shady moments and set it on its original path.

Edit:
I guess this shows that logic creating top_level.txt must be moved from pip to setuptools to properly define metadata.

Edit 2:
Also, this will allow to handle cases like that of PyYAML where DP name differs from IP name: Name: PyYAML ... Provides-Dist: yaml.

P.S. Although Dist in names of fields makes all of this kind of confusing.

@jaraco
Copy link
Member Author

jaraco commented Jan 13, 2021

Provides-Dist indeed allows to specify Import Packages

I think you may be mistaken on this point. The way I interpret the text above, it's not a "Import Package" but a "Distribution Package" (aka distutils project):

Each entry contains a string naming a Distutils project

So the values of Provides-Dist should be the names of distutils projects (as found in PyPI or virtual ones not found in PyPI), but as far as I can tell doesn't map to Import Packages.

This is more evident if you go as far back as Provides:

Each entry contains a string describing a package or module that will be provided by this package once it is installed.

I suspect that Provides would in fact implement the behavior you're seeking, providing a mapping from the Distribution package to the "Import packages" and modules presented by the Distribution, but since this field was present in Metadata 1.1 and seemingly removed in Metadata 1.2 (apparently replaced by Provides-Dist, which specifies a Distutils project).

What's more important, however, is that Provides is no longer part of the spec and Provides-Dist is part of rarely-used fields, which are explicitly not honored by the packaging ecosystem, so I wouldn't expect those to have the behavior you desire. In my opinion, using Provides-Dist to serve the purpose of Provides from Metadata 1.1 would probably violate the spec as currently written and accepted.

Most importantly, there's nothing about the importlib_metadata project that has any implications here. importlib_metadata is mostly agnostic about the contents of the metadata and mainly serves as a mechanism to parse the syntax of metadata for programmatic consumption. It does minimal interpretation of the values of the metadata.


Thinking about your use-case, providing for the user a way to download the relevant distribution packages based on the Python packages they wish to import may prove to be a lot more challenging than just exposing metadata in the package. You'll also need support in the index to expose that metadata in a searchable way.

I guess if you're only looking to validate dependencies in a local environment, it might be possible to do without affecting the index. Still, I'm not sure you can achieve what you need. Consider, for example, the path module, which is exposed by multiple packages (path.py and path). If a project does import path, would you allow path.py or path as distribution packages to satisfy that dependency?

I guess this shows that logic creating top_level.txt must be moved from pip to setuptools to properly define metadata.

You may be on to something with top_level.txt. Have you tried using importlib_metadata.distribution(mypkg).read_text('top_level.txt').split()? Does that provide the mapping from mypkg that you need?

~ $ pip-run -q jaraco.functools jaraco.classes pyyaml -- -c "import importlib.metadata as md; dists = md.distributions(); pkg_to_dist = {pkg: dist.metadata['Name'] for dist in dists for pkg in dist.read_text('top_level.txt').split()}; import pprint; pprint.pprint(pkg_to_dist)"
{'_yaml': 'PyYAML',
 'argcomplete': 'argcomplete',
 'certifi': 'certifi',
 'click': 'click',
 'jaraco': 'jaraco.functools',
 'more_itertools': 'more-itertools',
 'packaging': 'packaging',
 'pip': 'pip',
 'pip-run': 'pip-run',
 'pip_run': 'pip-run',
 'pipx': 'pipx',
 'pyparsing': 'pyparsing',
 'six': 'six',
 'userpath': 'userpath',
 'yaml': 'PyYAML'}

There are some gaps there in that items in top_level.txt doesn't uniquely map to a distribution (jaraco is supplied by both jaraco.functools and jaraco.classes)

Still, it does seem as if most of what you need is exposed through that (unofficial) metadata.

Note, I'm pretty sure that top_level.txt is already something that's prepared by Setuptools and not pip.

Given that Python has deprecated the declaration of namespace packages, you may have difficulty disentangling those names.

Does that help?

@8day
Copy link

8day commented Jan 14, 2021

Most importantly, there's nothing about the importlib_metadata project that has any implications here.

Considering that I was mistaken about Provides-Dist, this is true. Both of my replies were about your __distribution__ attribute: I thought that you may find them useful. In my first reply I assumed that you were asking for the help to understand potential use-cases of __distribution__ by other people, and in second I thought that I have found a way to avoid new protocol/PEP, hence make it easier to convince people to implement such a feature.

Speaking about __distribution__. Seeing as you admit the need for a mapping between "Python packages" and "distribution packages", I'll add a few more thoughts as they seem to be related and ATM there's no other place to share them. First and foremost, information about namespace packages and regular Import Packages should be a part of Distribution Package metadata because that is the kind of information that people might want to see alongside other project properties/metadata on PyPI (or at least it must be present in addition to __distribution__). Second, considering that ATM there are two distribution formats, sdist and wheel, and neither of them allows to have multiple Distribution Packages, Provides-Dist and Obsoletes-Dist are obsolete, therefore it may make sense to revert to original semantics of Provides, but in a new fields like Provides-IP and Provides-NSP. Provides-Module could've been used instead of Provides-IP and Provides-NSP, with NS package implied by the presence of dots in its name, but since that is a valid Distribution Package name, and because setuptools itself differentiates between the two with top_level.txt and namespace_packages.txt, two separate fields should be used.


I guess if you're only looking to validate dependencies in a local environment, it might be possible to do without affecting the index.

That's exactly what that "dependency checker" were supposed to do: just check if a package was already installed. This was a part of a feature where build backend's extension was used only when PEP 508 dependency was satisfied. E.g., you may want to run different set of commands depending on the platform.

Does that help?

In my case? Yes. In general? Kind of. In your first reply you proved that with the namespace packages in the picture, this is not the kind of problem that importlib_metadata should or even can solve. Namespace packages make the idea of a common interface, to have something like Package.files(), ridiculous -- it makes sense only for Distribution Packages. I.e.,importlib_metadata should deal only with Distribution Packages.


I guess you can close this issue.

@jaraco
Copy link
Member Author

jaraco commented Feb 21, 2021

Sounds good. Thanks for the hard consideration.

I did find that it's possible to get all of the distribution names for a given top-level package:

$ pip-run -q jaraco.functools jaraco.classes pyyaml -- -c "import importlib.metadata as md, collections; dists = md.distributions(); pkg_to_dist = collections.defaultdict(list); [pkg_to_dist[pkg].append(dist.metadata['Name']) for dist in dists for pkg in dist.read_text('top_level.txt').split()]; import pprint; pprint.pprint(dict(pkg_to_dist))"
{'_yaml': ['PyYAML'],
 'argcomplete': ['argcomplete'],
 'autocommand': ['autocommand'],
 'certifi': ['certifi'],
 'click': ['click'],
 'dist': ['jaraco.classes', 'jaraco.functools', 'path', 'pip-run'],
 'examples': ['pip-run'],
 'jaraco': ['jaraco.classes', 'jaraco.functools'],
 'more_itertools': ['more-itertools'],
 'packaging': ['packaging'],
 'path': ['path'],
 'pip': ['pip'],
 'pip-run': ['pip-run'],
 'pip_run': ['pip-run'],
 'pipx': ['pipx'],
 'pyparsing': ['pyparsing'],
 'six': ['six'],
 'userpath': ['userpath'],
 'yaml': ['PyYAML']}

That also revealed an issue with jaraco.skeleton in its packaging (so disregard "dist" in that dict).

jaraco added a commit that referenced this issue Feb 24, 2021
jaraco added a commit that referenced this issue Feb 24, 2021
This was referenced Mar 15, 2021
yuvipanda added a commit to yuvipanda/python-popularity-contest that referenced this issue Jul 10, 2021
To answer the questions we have, we need data on libraries
installed in the environment, not packages that are
imported. importlib_metadata gives us access to the
RECORDS file (https://www.python.org/dev/peps/pep-0376/#record)
for every package, and we build a reverse mapping of
package name -> distribution once. Distribution
(I am calling them libraries) names are then used.

Since distributions are 'installed' specifically, they
already ignore modules in the standard library and any
local user written modules. The dependency on the stdlib_list
library can be removed.

All metric names have been changed to talk about libraries,
not packages.

The word 'package' is so overloaded, and nobody knows anything about
'distributions'. https://packaging.python.org/glossary/ is somewhat
helpful. I will now try to use just modules (something that can be
imported - since our source is sys.modules) and "library" (what is
installed with pip or conda - aka a distribution).

Despite what python/importlib_metadata#131
says, the package_distributions function in importlib_metadata relies
on the undocumented `top_level.txt` file, and does not work with
anything not using setuptools. So we go through all the
RECORDs ourselves.

Added some unit tests, and refactored some functions to make them
easier to test. Import-time side effects are definitely harder to
test, so I now require an explicit setup function call. This makes
testing much easier, and is also more intuitive.

Bump version number
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants