Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What metadata/installability checks should Warehouse check uploads for? #194

Open
msabramo opened this issue Feb 6, 2014 · 30 comments
Open
Labels
feature request needs discussion a product management/policy issue maintainers and users should discuss testing Test infrastructure and individual tests

Comments

@msabramo
Copy link
Contributor

msabramo commented Feb 6, 2014

Recently, a project that I worked on pushed a new version to PyPI and it turned out to be completely broken, because the setup.py was referencing a README.rst file that was not present in the sdist.

It would be awesome if PyPI could so checking of packages that are uploaded.

To start with, it could create a virtualenv and try pip installing the package and make sure that exits with status 0. Perhaps it could also try easy_install to make sure that works too.

There are more elaborate things that could be done like run tests if they are included (many packages don't bundle their tests though) or try to validate the RST, but I think just pip installing is a very good first start, as it detects packages that are completely broken.

@sontek
Copy link

sontek commented Feb 6, 2014

What I'd love is to see the packaging/build tooling help prevent this before it ever gets up to pypi/warehouse, although it might be nice to do some sanity checking of its own as well.

Its a weird issue since the way this is done is people reading files in setup.py, but would be nice if this was somehow detected and alert you that you are referencing things that wont end up in the dist.

@agronholm
Copy link

I don't think executing untrusted code on the PyPI server is an option. Can this be done without?

@r1chardj0n3s
Copy link

This is partly what the cheesecake project attempted to do; I like the idea in theory but it requires a bunch of work and support to make happen in practise.

@msabramo
Copy link
Contributor Author

I guess tox would've caught this. Not everyone uses tox though and even people who do sometimes forget. Also tox could probably pass inadvertently because of a file in the dev's workspace that's not pushed up to the canonical git repo when using setuptools_git

I note that CPAN does automated testing of packages - I have no idea how they sandbox. VMs would be my guess. Obviously Travis CI and drone.io sandbox code to run tests. A system with VMs or Docker containers could help.

That said I recognize that this requires non-trivial resources so it's not easy.

@msabramo
Copy link
Contributor Author

How awesome would it be to search PyPI and be able to sort by test coverage?!

What if setup.py had a field for ci_server and you could plug in Travis, Circle, Drone.io, etc. and they could run on your release tarball rather than your VCS repo?

@agronholm
Copy link

What would be even more awesome is if we ran pyroma against uploads and users could filter out projects with low scores? I would personally find that VERY helpful. I would even vote for rejecting uploads that get lower than a minimum score. IIRC pyroma doesn't even require sandboxing.

@msabramo
Copy link
Contributor Author

Example of how CPAN runs tests on packages:

http://ppm4.activestate.com/i686-linux/5.16/1600/M/MS/MSABRAMO/App-TarColor-0.011.d/log-20120801T133902.txt

We can't let Perl show us up! 😄

ActiveState is active in Python as well as Perl -- maybe they are interesting in hosting a service like this like they do for the Perl community?

@dstufft
Copy link
Member

dstufft commented Apr 23, 2014

@agronholm Ok so Looking at pyroma it has the following checks

  1. The package should have a name, a version and a Description. If it does not, it will recieve a rating of 0.
  2. The version number should be a string. A floating point number will work with distutils, but most other tools will fail.
  3. The version number should comply to PEP386.
  4. The long_description should be over a 100 characters.
  5. Pyroma will convert your long_description to HTML using Docutils, to verify that it is possible. This guarantees pretty formatting of your description on PyPI. As long as Docutils can convert it, this passes, even if there are warnings or error in the conversion. These warnings and errors are printed to stdout so you will see them.
  6. You should have a the following meta data fields filled in: classifiers, keywords, author, author_email, url and license.
  7. You should have classifiers specifying the sypported Python versions.
  8. If you are using setuptools or distribute you should specify zip_safe, as it defaults to "true" and that's probably not what you want.
  9. If you are using setuptools or distribute you can specify a test_suite to run tests with 'setup.py test'. This makes it easy to run tests for both humans and automated tools.
  10. If you are checking on a PyPI package, and not a local directory or local package, pyroma will check the number of owners the package has on PyPI. It should be three or more, to minimize the "Bus factor", the risk of the index owners suddenly going off-line for whatever reason.
  11. If you are checking on a PyPI package, and not a local directory or local package, Pyroma will look for documentation for your package at pythonhosted.org and readthedocs.org. If it can't find it, it prints out a message to that effect. However, since you may have documentation elsewhere, this does not affect your rating.

So going through these checks!

  1. Yes, Yes, Description or Summary? Either way, Maybe.
  2. Yes, but I don't think PyPI can actually look at this without executing the setup.py
  3. Yes, once PEP440 is finalized I want to enforce this.
  4. Maybe
  5. Pyroma probably does it wrong, because just because it works in docutils doesn't mean PyPI will render it. But I agree with the check in spirit.
  6. * Classifiers -> Yes, but specific ones
    • Keywords -> No, they are kinda dumb
    • Author -> Sure
    • Author Email -> No
    • URL -> No, sometimes the PyPI Page is the URL
    • License -> I think this is better handled by classifiers
  7. Yes-ish, In reality a dedicated metadata would be better for this but this is what we have ATM
  8. We can't determine this from PyPI w/o executing
  9. Useful, but I'm not sure we want to be hardcoding in setup.py stuff like this right now
  10. Ehh, I'm not real excited about this one. 3 is an arbitrary number and I don't think giving people a negative mark because they didn't convince two other people to maintain their library isn't very nice.
  11. Documentation is a big one i'd like to do as well, Docs on pythonhosted.org we can detect super easily, however RTD is harder. Crate did this and had to maintain a table of mappings because it had to guess at what the RTD link would be. This will be easier if we key it off a specially named project-url, but setuptools doesn't submit project urls ATM so it's hard to give people a negative mark for something that they can't actually give us yet if they aren't using pythonhosted.org

@agronholm
Copy link

What I propose is:

  1. Enforcing that all the minimum files for installation (setup.py and egg-info/dist-info/whatever) is in place
  2. Enforcing that name, description, long_description and license are filled out
  3. Enforcing that the version conforms to the relevant PEPs (for new projects at least)
  4. Enforcing that the supported range of Python versions is somehow indicated in the metadata
  5. Filtering out projects in search that don't have downloadable distributions available

@dstufft
Copy link
Member

dstufft commented Apr 23, 2014

  1. Yes!, PyPI Legacy "tries" to do this now, but I don't think it does a very good job at it.
  2. Techincally License field isn't supposed to be filled out if a license classifier fits, so as long as we have that logic in place, then OK by me.
  3. Need to figure out what sort of coverage the related PEPs have over the current versions on Python, but I agree.
  4. Probably OK
  5. Yes!

@nlhkabu nlhkabu added the requires triaging maintainers need to do initial inspection of issue label Jul 2, 2016
@brainwane brainwane added this to the Someday/maybe milestone Dec 7, 2017
@brainwane
Copy link
Contributor

This would be a fantastic feature. I'm only moving this to a future milestone because it is not on the critical path for the immediate goal of switching from the old PyPI to Warehouse.

@jayfk
Copy link
Contributor

jayfk commented Jan 25, 2018

Is this something people are still interested in?

If we had a system in place that allows to fully install a package there’s lots of other useful information we can extract from it. Transitive dependencies, metadata, even tests.

@agronholm
Copy link

Yes, people are still interested in this but it's been deferred until the immediate goals have been met and Warehouse has replaced Cheeseshop as the official PyPI.

@jayfk
Copy link
Contributor

jayfk commented Jan 25, 2018

Best thing is probably to make this a sandboxed standalone project with some kind of API that warehouse can call.

I’m currently working on a POC for this. If anyone is interested, please let me know.

@brainwane
Copy link
Contributor

@jayfk I would love for you to post to pypa-dev to give a heads-up that you're doing this, so people can speak up there if they're interested in helping.

@brainwane
Copy link
Contributor

@jayfk could you post to pypa-dev or the distutils-sig list about your proof of concept? Thanks!

@brainwane
Copy link
Contributor

Jannis posted to distutils-sig -- thanks @jayfk!

@jayfk
Copy link
Contributor

jayfk commented Feb 12, 2018

Yep! Thanks for pointing that out @brainwane!

@brainwane brainwane added feature request testing Test infrastructure and individual tests and removed requires triaging maintainers need to do initial inspection of issue labels Feb 12, 2018
@zenogantner
Copy link

Actually, Perl has CPANTesters, which lets the community help to do the testing over a lot of different platforms and version combinations.
And they have had it for more than a decade.

This is definitely something where Python can learn from Perl.

@brainwane brainwane changed the title Package checking Automatically check uploaded packages (metadata, installability) Jun 20, 2019
@brainwane
Copy link
Contributor

This is an issue that came up in a discussion of improving pip's automated tests last week; pip would find this feature helpful.

@brainwane
Copy link
Contributor

Reminder to folks following this: as @ewdurbin and @pradyunsg and I talked about in a meeting about the pip resolver work, getting this implemented might help us smooth testing of and the effects of the resolver rollout. So if any volunteers could help get this finished and merged, we'd appreciate that! cc @yeraydiazdiaz

@brainwane brainwane added the needs discussion a product management/policy issue maintainers and users should discuss label Apr 3, 2020
@brainwane brainwane changed the title Automatically check uploaded packages (metadata, installability) What metadata/installability checks should Warehouse check uploads for? Apr 3, 2020
@brainwane
Copy link
Contributor

@uranusjr @pfmoore @pradyunsg I'd appreciate if you could review this issue, especially #194 (comment) and the comments below it, which discuss some upload checks we could implement in Warehouse (blocking noncompliant uploads). Which of those checks would be particularly valuable to the pip resolver work? Once we have that list, we can make a checklist so this issue becomes more completable.

@uranusjr
Copy link
Contributor

uranusjr commented Apr 4, 2020

One wild wish I’ve always hoped for is to eliminate dynamic dependency altogether, and ensure all dists of a given version specify exactly the same set of dependencies. One of the most resource-consuming (both for development and runtime) part of dependency resolution is the need to download a package matching the host environment, extract it, and potentially build it to get dependencies. This can cost a tremendous amount of time if you’re on a platform without wheels (e.g. musl), and it would be a vast improvement if we are able to download (say) a wheel for Windows and know its metadata would match if I build it on a random Linux machine. m

@pfmoore
Copy link
Contributor

pfmoore commented Apr 4, 2020

It would be great to have a check that the name and version in the metadata match the filename (sdist or wheel), so that we don't have to abort on mismatches. (We'd still have to check, for local files and other edge cases, but knowing that the filename is reliable would still be useful).

As @uranusjr mentioned, the only really useful checks from the pip resolver POV are ones which would allow us to minimise downloading of distributions. All we use is name, version and dependencies.

So for us:

  1. As I said, knowing that the filename gives us the correct name/version is massive. Sufficiently so that we make that assumption already and crash out if it turns out to be wrong.
  2. Short of Warehouse growing a "give me the metadata" API, we have to download to get dependencies, so the only savings here for us would be across files (i.e., checks that all sdists and wheels for package FOO version X.Y.Z return the same dependency metadata). If we had that, we could cache dependencies for name/version, or extract them from a wheel rather than the corresponding sdist, saving a build step. But we'd still need to special-case PyPI to enable that, so the cost in complexity may mean it's not worth it.

@dstufft
Copy link
Member

dstufft commented Apr 4, 2020

As I said, knowing that the filename gives us the correct name/version is massive. Sufficiently so that we make that assumption already and crash out if it turns out to be wrong.

I'd argue this is already something you should be able to assume. If those two things don't line up I'd just call it an error. We've never made any promises that it would work, and if it does currently work I'd call it an implementation detail that just happened to allow it for awhile.

Short of Warehouse growing a "give me the metadata" API

I'm fairly sure that at some point we will supplant the simple API with a better one that will include the needed information, but that is obviously not yet there. That being said, it's probably useful to start checking/enforcing any useful checks now to get better data in the long run.

@dstufft
Copy link
Member

dstufft commented Apr 4, 2020

eliminate dynamic dependency altogether, and ensure all dists of a given version specify exactly the same set of dependencies.

I'm not sure that this is possible, or even desirable. At least when we talked about PEP 517 @njsmith argued pretty strongly that we need some sort of hook that enabled programatic dependencies, and that we simply could not depend on static only metadata for sdists.

@brainwane
Copy link
Contributor

At the summit last year @crwilcox asked:

can we fail when there's missing author, author_email, URL. Currently warnings on setup.

@dstufft said, above (in 2014):

  • Author -> Sure
  • Author Email -> No
  • URL -> No, sometimes the PyPI Page is the URL

@dstufft do you agree with your 2014 self that author email and project URL should not be mandatory?

@brainwane
Copy link
Contributor

@dstufft @pfmoore @pradyunsg @ewdurbin @di @uranusjr @techalchemy OK, I reread the above discussion. It sounds like the metadata/installability checks that make sense for Warehouse to mandate are:

  • Package name
    • must exist
    • must match filename (sdist or wheel)
  • Version
    • must exist
    • must be PEP 440-compliant
    • must match filename (sdist or wheel)
  • Description: must exist
  • Author name: must exist
  • Classifiers for supported Python versions - potentially supplanted by Reject packages without a Requires-Python #3889/Requires-Python
  • Certain additional classifiers (which ones?)
    • License

Does this sound right? Are any of these already mandated?

@pfmoore
Copy link
Contributor

pfmoore commented Apr 8, 2020

Sounds reasonable to me.

I'd argue that there should be some means of contacting the author, but I concede that "issue tracker on the project webpage" is entirely reasonable for that. So we move out of the area of what should be checked, and into a new metadata field.

@di
Copy link
Member

di commented Apr 8, 2020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request needs discussion a product management/policy issue maintainers and users should discuss testing Test infrastructure and individual tests
Projects
None yet
Development

No branches or pull requests