Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always copy metadata, on a best-effort basis #5814

Closed
richardsheridan opened this issue May 8, 2021 · 7 comments · Fixed by #5830
Closed

Always copy metadata, on a best-effort basis #5814

richardsheridan opened this issue May 8, 2021 · 7 comments · Fixed by #5830
Labels
feature Feature request kind:to be discussed Please add our opionon and rational

Comments

@richardsheridan
Copy link
Contributor

Is your feature request related to a problem? Please describe.
Many applications packaged by Pyinstaller will include packages from Pypi, which will tend to include metadata according to PEP 566 / core metadata specification (v2.1). So this metadata is often (not universally) available, but never included by default by Pyinstaller.

Describe the solution you'd like
Include the result of copy_metadata by default as if it had been run within a hook as

try:
    datas += copy_metadata("package")
except RuntimeError:
    pass 

Describe alternatives you've considered
The current solution would be to specify a hook for every package, or use --copy-metadata repeatedly, or specify the metadata you care about manually in datas.

Additional context
This should be changed because it causes the following issues:

  1. __version__ = importlib.metadata.version("package") is a fairly common pattern that is suggested in the setuptools_scm readme, leading to a continual stream of requests for trivial hooks that simply call copy_metadata: PackageNotFoundError: importlib_metadata with Pluggy 0.12.0 #4258, Pyinstaller Broken Module Import with Infinisdk pyinstaller-hooks-contrib#509.
  2. License information can often be found in the metadata and will help Pyinstaller users to stay in compliance with their dependencies' licenses: Pyinstaller prunes license text of included libraries and packages #5666 . This info is to become more standardized in PEP639.
  3. It breaks the pip-packager's assumptions about what metadata will be available at runtime.

This would be sort of a first step for #4881, I suppose. However, recent work in #5774 should relieve concerns about the known problems. I think this feature request could be included/implemented without having to fully resolve #4881.

@richardsheridan richardsheridan added the feature Feature request label May 8, 2021
@bwoodsend
Copy link
Member

Honestly I have mixed opinions on this one.

  1. Code like __version__ = importlib.metadata.version("package") is functionless. Either you need the version, in which case you should use importlib.metadata.version("package") directly, or you don't, in which case this is a wasted operation and a needless extra dependency on importlib-metadata for older Pythons.
  2. AFAIK the license (par its name) isn't practically retrievable at runtime anyway so lumping all the metadata isn't enough to deal with the licenses issue. Yes, they're collected but they buried deep inside onedir builds and are virtually invisible in onefile builds. The licenses all say that they must be promenantly visible, either with a command line switch or an about button on a UI which would require programmatic access.

With those in mind, the only real usage for a package's metadata that I see is entry points. These can now be collected easily via collect_entry_point("entrypoint name") so I see little benefit in including metadata by default. In fact, even if we did collect all metadata, you'd still need to --hiddenimport every plugin package you wanted to include.

Part of the problem is that it isn't copy_metadata("package"). It's really copy_metadata("distribution"). So PyInstaller sees import PIL and is expected to know to use copy_metadata("pillow") instead of copy_metadata("PIL"). You can work out which distributions are used by looking at every file in every distribution and checking if it has any files in common with what PyInstaller has collected. It'll be slow on large environments but then PyInstaller is next to useless on such environments anyway.

FWIW: I've just started working on an automatic metadata feature similar to what we use for ctypes.CDLL. Namely to scan collected bytecode for references to pkg_resources.get_distribution("...") (and its many alternatives) then copy_metadata() the "..." which would solve the problem more efficiently (assuming it works).

That all said, par the loss of speed, I don't think there's really anything lost in always copying metadata.

@bwoodsend bwoodsend added the kind:to be discussed Please add our opionon and rational label May 8, 2021
@bwoodsend
Copy link
Member

@rokm, @Legorooj, @BoboTiG if you're there: Thoughts on this one?

@richardsheridan
Copy link
Contributor Author

richardsheridan commented May 8, 2021

  1. Code like __version__ = importlib.metadata.version("package") is functionless. Either you need the version, in which case you should use importlib.metadata.version("package") directly, or you don't, in which case this is a wasted operation and a needless extra dependency on importlib-metadata for older Pythons.

I totally agree with this, having just ripped such code out of my project... Because it was breaking pyinstaller, leading to me submitting this issue.

That all said, par the loss of speed, I don't think there's really anything lost in always copying metadata.

I guess increased expectations from users on the availability of metadata could be called a loss.

@rokm
Copy link
Member

rokm commented May 9, 2021

I suppose I wouldn't be opposed to automatic collection of all metadata belonging to collected packages (or rather, their distribution), provided we can reliably map those packages to their distributions.

It would certainly remove the need for copy_metadata in hooks, and I think this would be especially beneficial for cases where package_a uses metadata to query version of its dependencies at run time. In those cases, it becomes unclear which hook should actually be collecting the metadata for those dependencies; should they be all collected by hook for package_a (which is the consumer), or should n separate hooks be created, one for each of those packages (which, for the sake of argument, do not require that metadata on their own)?

The metadata collection probably wouldn't do much for the license compliance (especially in onefile builds), but I guess the same infrastructure for obtaining list of used packages/distributions could be reused in the license collector.

As for the downsides; I suppose the added size of the collected metadata shouldn't be that much of an issue, but all those extra directories probably won't look very pretty in onedir builds.

@rokm
Copy link
Member

rokm commented May 9, 2021

Namely to scan collected bytecode for references to pkg_resources.get_distribution("...") (and its many alternatives)

Would that handle the case when the distribution name is passed via variable, e.g., inside a loop?

@Legorooj
Copy link
Member

Legorooj commented May 9, 2021

I think we definitely need to up the metadata included. I think @bwoodsend's code scanning solution is probably a good one, at least to start with. I'd be hesitant to do an automatic full-metadata collection for every package until we figure out some reliable + efficient way to do it.

@bwoodsend
Copy link
Member

No chance. But it'll do __version__ = importlib.metadata.version("package") and importlib.metadata.requires("foo>=x.y.x") which would be most of them.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 3, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature Feature request kind:to be discussed Please add our opionon and rational
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants