What metadata/installability checks should Warehouse check uploads for? #194

msabramo · 2014-02-06T18:54:30Z

Recently, a project that I worked on pushed a new version to PyPI and it turned out to be completely broken, because the setup.py was referencing a README.rst file that was not present in the sdist.

It would be awesome if PyPI could so checking of packages that are uploaded.

To start with, it could create a virtualenv and try pip installing the package and make sure that exits with status 0. Perhaps it could also try easy_install to make sure that works too.

There are more elaborate things that could be done like run tests if they are included (many packages don't bundle their tests though) or try to validate the RST, but I think just pip installing is a very good first start, as it detects packages that are completely broken.

The text was updated successfully, but these errors were encountered:

sontek · 2014-02-06T21:14:06Z

What I'd love is to see the packaging/build tooling help prevent this before it ever gets up to pypi/warehouse, although it might be nice to do some sanity checking of its own as well.

Its a weird issue since the way this is done is people reading files in setup.py, but would be nice if this was somehow detected and alert you that you are referencing things that wont end up in the dist.

agronholm · 2014-04-15T17:36:13Z

I don't think executing untrusted code on the PyPI server is an option. Can this be done without?

r1chardj0n3s · 2014-04-15T17:38:16Z

This is partly what the cheesecake project attempted to do; I like the idea in theory but it requires a bunch of work and support to make happen in practise.

msabramo · 2014-04-21T23:04:51Z

I guess tox would've caught this. Not everyone uses tox though and even people who do sometimes forget. Also tox could probably pass inadvertently because of a file in the dev's workspace that's not pushed up to the canonical git repo when using setuptools_git

I note that CPAN does automated testing of packages - I have no idea how they sandbox. VMs would be my guess. Obviously Travis CI and drone.io sandbox code to run tests. A system with VMs or Docker containers could help.

That said I recognize that this requires non-trivial resources so it's not easy.

msabramo · 2014-04-21T23:09:25Z

How awesome would it be to search PyPI and be able to sort by test coverage?!

What if setup.py had a field for ci_server and you could plug in Travis, Circle, Drone.io, etc. and they could run on your release tarball rather than your VCS repo?

agronholm · 2014-04-21T23:21:01Z

What would be even more awesome is if we ran pyroma against uploads and users could filter out projects with low scores? I would personally find that VERY helpful. I would even vote for rejecting uploads that get lower than a minimum score. IIRC pyroma doesn't even require sandboxing.

msabramo · 2014-04-22T17:03:23Z

Example of how CPAN runs tests on packages:

http://ppm4.activestate.com/i686-linux/5.16/1600/M/MS/MSABRAMO/App-TarColor-0.011.d/log-20120801T133902.txt

We can't let Perl show us up! 😄

ActiveState is active in Python as well as Perl -- maybe they are interesting in hosting a service like this like they do for the Perl community?

dstufft · 2014-04-23T18:20:04Z

@agronholm Ok so Looking at pyroma it has the following checks

The package should have a name, a version and a Description. If it does not, it will recieve a rating of 0.
The version number should be a string. A floating point number will work with distutils, but most other tools will fail.
The version number should comply to PEP386.
The long_description should be over a 100 characters.
Pyroma will convert your long_description to HTML using Docutils, to verify that it is possible. This guarantees pretty formatting of your description on PyPI. As long as Docutils can convert it, this passes, even if there are warnings or error in the conversion. These warnings and errors are printed to stdout so you will see them.
You should have a the following meta data fields filled in: classifiers, keywords, author, author_email, url and license.
You should have classifiers specifying the sypported Python versions.
If you are using setuptools or distribute you should specify zip_safe, as it defaults to "true" and that's probably not what you want.
If you are using setuptools or distribute you can specify a test_suite to run tests with 'setup.py test'. This makes it easy to run tests for both humans and automated tools.
If you are checking on a PyPI package, and not a local directory or local package, pyroma will check the number of owners the package has on PyPI. It should be three or more, to minimize the "Bus factor", the risk of the index owners suddenly going off-line for whatever reason.
If you are checking on a PyPI package, and not a local directory or local package, Pyroma will look for documentation for your package at pythonhosted.org and readthedocs.org. If it can't find it, it prints out a message to that effect. However, since you may have documentation elsewhere, this does not affect your rating.

So going through these checks!

Yes, Yes, Description or Summary? Either way, Maybe.
Yes, but I don't think PyPI can actually look at this without executing the setup.py
Yes, once PEP440 is finalized I want to enforce this.
Maybe
Pyroma probably does it wrong, because just because it works in docutils doesn't mean PyPI will render it. But I agree with the check in spirit.
* Classifiers -> Yes, but specific ones
- Keywords -> No, they are kinda dumb
- Author -> Sure
- Author Email -> No
- URL -> No, sometimes the PyPI Page is the URL
- License -> I think this is better handled by classifiers
Yes-ish, In reality a dedicated metadata would be better for this but this is what we have ATM
We can't determine this from PyPI w/o executing
Useful, but I'm not sure we want to be hardcoding in setup.py stuff like this right now
Ehh, I'm not real excited about this one. 3 is an arbitrary number and I don't think giving people a negative mark because they didn't convince two other people to maintain their library isn't very nice.
Documentation is a big one i'd like to do as well, Docs on pythonhosted.org we can detect super easily, however RTD is harder. Crate did this and had to maintain a table of mappings because it had to guess at what the RTD link would be. This will be easier if we key it off a specially named project-url, but setuptools doesn't submit project urls ATM so it's hard to give people a negative mark for something that they can't actually give us yet if they aren't using pythonhosted.org

agronholm · 2014-04-23T18:33:44Z

What I propose is:

Enforcing that all the minimum files for installation (setup.py and egg-info/dist-info/whatever) is in place
Enforcing that name, description, long_description and license are filled out
Enforcing that the version conforms to the relevant PEPs (for new projects at least)
Enforcing that the supported range of Python versions is somehow indicated in the metadata
Filtering out projects in search that don't have downloadable distributions available

dstufft · 2014-04-23T19:00:23Z

Yes!, PyPI Legacy "tries" to do this now, but I don't think it does a very good job at it.
Techincally License field isn't supposed to be filled out if a license classifier fits, so as long as we have that logic in place, then OK by me.
Need to figure out what sort of coverage the related PEPs have over the current versions on Python, but I agree.
Probably OK
Yes!

brainwane · 2017-12-07T15:03:51Z

This would be a fantastic feature. I'm only moving this to a future milestone because it is not on the critical path for the immediate goal of switching from the old PyPI to Warehouse.

jayfk · 2018-01-25T12:36:32Z

Is this something people are still interested in?

If we had a system in place that allows to fully install a package there’s lots of other useful information we can extract from it. Transitive dependencies, metadata, even tests.

agronholm · 2018-01-25T14:57:53Z

Yes, people are still interested in this but it's been deferred until the immediate goals have been met and Warehouse has replaced Cheeseshop as the official PyPI.

jayfk · 2018-01-25T19:01:28Z

Best thing is probably to make this a sandboxed standalone project with some kind of API that warehouse can call.

I’m currently working on a POC for this. If anyone is interested, please let me know.

brainwane · 2018-01-25T19:08:52Z

@jayfk I would love for you to post to pypa-dev to give a heads-up that you're doing this, so people can speak up there if they're interested in helping.

brainwane · 2018-02-05T20:44:31Z

@jayfk could you post to pypa-dev or the distutils-sig list about your proof of concept? Thanks!

brainwane · 2018-02-11T18:43:47Z

Jannis posted to distutils-sig -- thanks @jayfk!

jayfk · 2018-02-12T22:10:32Z

Yep! Thanks for pointing that out @brainwane!

zenogantner · 2018-10-04T09:41:59Z

Actually, Perl has CPANTesters, which lets the community help to do the testing over a lot of different platforms and version combinations.
And they have had it for more than a decade.

This is definitely something where Python can learn from Perl.

brainwane · 2020-01-27T17:15:29Z

This is an issue that came up in a discussion of improving pip's automated tests last week; pip would find this feature helpful.

brainwane · 2020-04-02T21:43:04Z

Reminder to folks following this: as @ewdurbin and @pradyunsg and I talked about in a meeting about the pip resolver work, getting this implemented might help us smooth testing of and the effects of the resolver rollout. So if any volunteers could help get this finished and merged, we'd appreciate that! cc @yeraydiazdiaz

brainwane · 2020-04-03T22:08:46Z

@uranusjr @pfmoore @pradyunsg I'd appreciate if you could review this issue, especially #194 (comment) and the comments below it, which discuss some upload checks we could implement in Warehouse (blocking noncompliant uploads). Which of those checks would be particularly valuable to the pip resolver work? Once we have that list, we can make a checklist so this issue becomes more completable.

uranusjr · 2020-04-04T04:24:45Z

One wild wish I’ve always hoped for is to eliminate dynamic dependency altogether, and ensure all dists of a given version specify exactly the same set of dependencies. One of the most resource-consuming (both for development and runtime) part of dependency resolution is the need to download a package matching the host environment, extract it, and potentially build it to get dependencies. This can cost a tremendous amount of time if you’re on a platform without wheels (e.g. musl), and it would be a vast improvement if we are able to download (say) a wheel for Windows and know its metadata would match if I build it on a random Linux machine. m

pfmoore · 2020-04-04T10:09:29Z

It would be great to have a check that the name and version in the metadata match the filename (sdist or wheel), so that we don't have to abort on mismatches. (We'd still have to check, for local files and other edge cases, but knowing that the filename is reliable would still be useful).

As @uranusjr mentioned, the only really useful checks from the pip resolver POV are ones which would allow us to minimise downloading of distributions. All we use is name, version and dependencies.

So for us:

As I said, knowing that the filename gives us the correct name/version is massive. Sufficiently so that we make that assumption already and crash out if it turns out to be wrong.
Short of Warehouse growing a "give me the metadata" API, we have to download to get dependencies, so the only savings here for us would be across files (i.e., checks that all sdists and wheels for package FOO version X.Y.Z return the same dependency metadata). If we had that, we could cache dependencies for name/version, or extract them from a wheel rather than the corresponding sdist, saving a build step. But we'd still need to special-case PyPI to enable that, so the cost in complexity may mean it's not worth it.

dstufft · 2020-04-04T17:56:07Z

As I said, knowing that the filename gives us the correct name/version is massive. Sufficiently so that we make that assumption already and crash out if it turns out to be wrong.

I'd argue this is already something you should be able to assume. If those two things don't line up I'd just call it an error. We've never made any promises that it would work, and if it does currently work I'd call it an implementation detail that just happened to allow it for awhile.

Short of Warehouse growing a "give me the metadata" API

I'm fairly sure that at some point we will supplant the simple API with a better one that will include the needed information, but that is obviously not yet there. That being said, it's probably useful to start checking/enforcing any useful checks now to get better data in the long run.

dstufft · 2020-04-04T17:57:32Z

eliminate dynamic dependency altogether, and ensure all dists of a given version specify exactly the same set of dependencies.

I'm not sure that this is possible, or even desirable. At least when we talked about PEP 517 @njsmith argued pretty strongly that we need some sort of hook that enabled programatic dependencies, and that we simply could not depend on static only metadata for sdists.

brainwane · 2020-04-08T15:48:48Z

At the summit last year @crwilcox asked:

can we fail when there's missing author, author_email, URL. Currently warnings on setup.

@dstufft said, above (in 2014):

Author -> Sure

Author Email -> No

URL -> No, sometimes the PyPI Page is the URL

@dstufft do you agree with your 2014 self that author email and project URL should not be mandatory?

brainwane · 2020-04-08T15:57:36Z

@dstufft @pfmoore @pradyunsg @ewdurbin @di @uranusjr @techalchemy OK, I reread the above discussion. It sounds like the metadata/installability checks that make sense for Warehouse to mandate are:

Package name
- must exist
- must match filename (sdist or wheel)
Version
- must exist
- must be PEP 440-compliant
- must match filename (sdist or wheel)
Description: must exist
Author name: must exist
Classifiers for supported Python versions - potentially supplanted by Reject packages without a Requires-Python #3889/Requires-Python
Certain additional classifiers (which ones?)
- License

Does this sound right? Are any of these already mandated?

pfmoore · 2020-04-08T16:10:17Z

Sounds reasonable to me.

I'd argue that there should be some means of contacting the author, but I concede that "issue tracker on the project webpage" is entirely reasonable for that. So we move out of the area of what should be checked, and into a new metadata field.

di · 2020-04-08T16:14:44Z

Name is required and validated
Version is required and validated
Description is optional (and IMO should remain so)
Author is optional (this field is also not required to contain contact information, so might not make sense to require)
Classifiers is optional (and IMO should remain so, Requies-Python is probably what we should focus on here)

nlhkabu added the requires triaging maintainers need to do initial inspection of issue label Jul 2, 2016

brainwane added this to the Someday/maybe milestone Dec 7, 2017

brainwane added feature request testing Test infrastructure and individual tests and removed requires triaging maintainers need to do initial inspection of issue labels Feb 12, 2018

brainwane mentioned this issue Jun 19, 2019

Increasing pip's & PyPI's metadata strictness pypa/packaging-problems#264

Open

brainwane changed the title ~~Package checking~~ Automatically check uploaded packages (metadata, installability) Jun 20, 2019

brainwane modified the milestones: Cool but not urgent, Package signing & detection/verification, Post Legacy Shutdown Jun 20, 2019

This was referenced Jun 21, 2019

admin interface for review of flagged packages #6062

Closed

Detect malicious packages, for later removal #5117

Closed

Ongoing strategies for spam #2982

Open

brainwane added the needs discussion a product management/policy issue maintainers and users should discuss label Apr 3, 2020

brainwane changed the title ~~Automatically check uploaded packages (metadata, installability)~~ What metadata/installability checks should Warehouse check uploads for? Apr 3, 2020

pfmoore mentioned this issue Apr 4, 2020

Speculative idea - Cache metadata for sdists? pypa/pip#7980

Open

brainwane mentioned this issue Apr 8, 2020

Search default or option: exclude projects with zero downloadable distributions #7778

Closed

This was referenced Apr 11, 2021

[Snyk] Security upgrade gulp-rev-all from 0.9.7 to 3.0.0 UbuntuEvangelist/warehouse#23

Open

[Snyk] Security upgrade gulp-rev-all from 0.9.8 to 3.0.0 Matthelonianxl/warehouse#50

Open

[Snyk] Fix for 3 vulnerabilities LiefBodhi/warehouse#10

Open

snyk-bot mentioned this issue Dec 16, 2021

[Snyk] Fix for 9 vulnerabilities Matthelonianxl/warehouse#97

Open

mattvonrocketstein mentioned this issue Oct 23, 2022

Deprecate call to setup.py install when building a wheel failed for source distributions without pyproject.toml pypa/pip#8368

Closed

LiefBodhi mentioned this issue Nov 27, 2023

[Snyk] Fix for 22 vulnerabilities LiefBodhi/warehouse#127

Open

terrorizer1980 mentioned this issue Nov 28, 2023

[Snyk] Fix for 11 vulnerabilities terrorizer1980/warehouse#158

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What metadata/installability checks should Warehouse check uploads for? #194

What metadata/installability checks should Warehouse check uploads for? #194

msabramo commented Feb 6, 2014

sontek commented Feb 6, 2014

agronholm commented Apr 15, 2014

r1chardj0n3s commented Apr 15, 2014

msabramo commented Apr 21, 2014

msabramo commented Apr 21, 2014

agronholm commented Apr 21, 2014

msabramo commented Apr 22, 2014

dstufft commented Apr 23, 2014

agronholm commented Apr 23, 2014

dstufft commented Apr 23, 2014

brainwane commented Dec 7, 2017

jayfk commented Jan 25, 2018

agronholm commented Jan 25, 2018

jayfk commented Jan 25, 2018

brainwane commented Jan 25, 2018

brainwane commented Feb 5, 2018

brainwane commented Feb 11, 2018

jayfk commented Feb 12, 2018

zenogantner commented Oct 4, 2018

brainwane commented Jan 27, 2020

brainwane commented Apr 2, 2020

brainwane commented Apr 3, 2020

uranusjr commented Apr 4, 2020

pfmoore commented Apr 4, 2020

dstufft commented Apr 4, 2020

dstufft commented Apr 4, 2020

brainwane commented Apr 8, 2020

brainwane commented Apr 8, 2020

pfmoore commented Apr 8, 2020

di commented Apr 8, 2020

What metadata/installability checks should Warehouse check uploads for? #194

What metadata/installability checks should Warehouse check uploads for? #194

Comments

msabramo commented Feb 6, 2014

sontek commented Feb 6, 2014

agronholm commented Apr 15, 2014

r1chardj0n3s commented Apr 15, 2014

msabramo commented Apr 21, 2014

msabramo commented Apr 21, 2014

agronholm commented Apr 21, 2014

msabramo commented Apr 22, 2014

dstufft commented Apr 23, 2014

agronholm commented Apr 23, 2014

dstufft commented Apr 23, 2014

brainwane commented Dec 7, 2017

jayfk commented Jan 25, 2018

agronholm commented Jan 25, 2018

jayfk commented Jan 25, 2018

brainwane commented Jan 25, 2018

brainwane commented Feb 5, 2018

brainwane commented Feb 11, 2018

jayfk commented Feb 12, 2018

zenogantner commented Oct 4, 2018

brainwane commented Jan 27, 2020

brainwane commented Apr 2, 2020

brainwane commented Apr 3, 2020

uranusjr commented Apr 4, 2020

pfmoore commented Apr 4, 2020

dstufft commented Apr 4, 2020

dstufft commented Apr 4, 2020

brainwane commented Apr 8, 2020

brainwane commented Apr 8, 2020

pfmoore commented Apr 8, 2020

di commented Apr 8, 2020