Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repeatable installs via hashing #3137

Closed
wants to merge 40 commits into from
Closed

Conversation

erikrose
Copy link
Contributor

Add checks against requirements-file-dwelling hashes for most kinds of packages. Close #1175.

This lets you add --hash=sha256:abcde… to a line in a requirements file to verify a package against a good local hash, guaranteeing repeatable installs without needing to run your own index mirrors, in the vein of https://pypi.python.org/pypi/peep/.

  • Add --require-hashes option. This is handy in deployment scripts to force application authors to hash their requirements. It is also a convenient way to get pip to show computed hashes for a virgin, unhashed requirements file. Eventually, additions to pip freeze should fill a superset of this use case.
    • In --require-hashes mode, at least one hash is required to match for each requirement, even subrequirements.
    • Option-based requirements (--hash=sha256:...) turn on --require-hashes mode implicitly.
    • Internet-derived URL-based hashes are "necessary but not sufficient": they do not satisfy --require-hashes mode when they match, but they are still used to guard against transmission errors.
    • Locally stored URL-based requirements (#md5=...) are treated just like option-based ones, except they don't turn on --require-hashes. Basically, that spelling is deprecated in favor of the option syntax.
    • --require-hashes disallows --egg, which would bypass its dependency control by delegating everything to setuptools.
  • Complain informatively, with the most devastating errors first so you don't chase your tail all day only to run up against a brick wall at the end. This also means we don't complain that a hash is missing, only for the user to find, after fixing it, that we have no idea how to even compute a hash for that type of requirement.
    • Complain about unpinned requirements when hash-checking mode is on, lest they cause the user surprise later.
    • Complain about missing hashes.
    • Complain about requirement types we don't know how to hash (like VCS ones and local dirs).
  • Have InstallRequirement keep its original Link around (original_link) so we can differentiate between URL hashes from requirements files and ones downloaded from the (untrustworthy) internet.
  • Remove test_download_hashes, which is obsolete. Similar coverage is provided in test_utils.TestHashes and the various hash cases in test_req.py.
  • Add a pip hash command for hashing package archives.
  • Add hash-checking support for pip download and pip wheel.

For a later release:

Review on Reviewable

We purposely keep it off the CLI for now. optparse isn't really geared to expose interspersed args and options, so a more heavy-handed approach will be necessary to support things like `pip install SomePackage --sha256=abcdef... OtherPackage --sha256=012345...`.
…f packages. Close pypa#1175.

* Add --require-hashes option. This is handy in deployment scripts to force application authors to hash their requirements. It is also a convenient way to get pip to show computed hashes for a virgin, unhashed requirements file. Eventually, additions to `pip freeze` should fill a superset of this use case.
  * In --require-hashes mode, at least one hash is required to match for each requirement.
  * Option-based requirements (--sha256=...) turn on --require-hashes mode implicitly.
  * Internet-derived URL-based hashes are "necessary but not sufficient": they do not satisfy --require-hashes mode when they match, but they are still used to guard against transmission errors.
  * Other URL-based requirements (#md5=...) are treated just like flag-based ones, except they don't turn on --require-hashes.
* Complain informatively, with the most devastating errors first so you don't chase your tail all day only to run up against a brick wall at the end. This also means we don't complain that a hash is missing, only for the user to find, after fixing it, that we have no idea how to even compute a hash for that type of requirement.
  * Complain about unpinned requirements when hash-checking mode is on, lest they cause the user surprise later.
  * Complain about missing hashes.
  * Complain about requirement types we don't know how to hash (like VCS ones and local dirs).
* Have InstallRequirement keep its original Link around (original_link) so we can differentiate between URL hashes from requirements files and ones downloaded from the (untrustworthy) internet.
* Remove test_download_hashes, which is obsolete. Similar coverage is provided in test_utils.TestHashes and the various hash cases in test_req.py.
Everybody seems to favor this. Spelled -H, it's still pretty short. And it is less unusual programmatically.
@erikrose erikrose changed the title Hashing Repeatable installs via hashing Sep 25, 2015
@erikrose
Copy link
Contributor Author

Hmm, those unicode encoding test failures don't occur locally. Curious.

[They do now!]

@dstufft
Copy link
Member

dstufft commented Sep 25, 2015

Didn't review this, but I think it shouldn't imply --no-deps, it should just reject anything that doesn't have a hash. I think implying --no-deps is going to lead to confusion if people manually update a package in here without noticing that it pulled in new deps or something like that. With --no-deps if we can't satisfy requirements there will be no error until possibly at runtime, that doesn't seem like the right semantics to me.

@dstufft
Copy link
Member

dstufft commented Sep 25, 2015

Oh, in addition, this is obviously not going to protect against setup_requires, I think that pip could probably instruct setuptools to hard fail if there is setup_requires instead of silently not giving the implied security semantics of the --require-hashes option. Not sure if that's worth it though, since in the future we'll hopefully move away from the implicit setup_requires mechanism and move to something that pip itself has control over.

@dstufft
Copy link
Member

dstufft commented Sep 25, 2015

Another thought: Would it make sense to do --hash sha256=abcd (maybe in addition to :) so people can copy/paste from PyPI? Maybe it doesn't make sense though, just throwing it out there.

@erikrose
Copy link
Contributor Author

Well, conforming to GNU conventions, it'd be --hash=sha256=abcd, which is a bit weird-looking to me…

@dstufft
Copy link
Member

dstufft commented Sep 25, 2015

Yea, that's a bit weird and it's probably not a major use case. I'm fine saying it's not worth it.

@@ -523,6 +524,47 @@ def only_binary():
)


def _good_hashes():
"""Return names of hashlib algorithms at least as strong as sha256."""
# Remove getattr when 2.6 dies.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think 2.7 has this before like, 2.7.9 so probably we can't get rid of it until we get rid of 2.7.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://docs.python.org/3/whatsnew/2.7.html suggests it was added in 2.7.0; it lists things added in 2.7.x separately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I'm stupid I was confusing it with guaranteed_hashes.

@dstufft
Copy link
Member

dstufft commented Sep 25, 2015

A few comments, will give it a more in depth review once I wake up more.

dstufft is nervous about blowing a single-char option on something that will usually be copied and pasted anyway. We can always put it back later if it proves to be a pain.
For dependencies that are properly pinned and hashed (not really dependencies at all, if you like, since they're explicit, root-level requirements), we install them as normal. For ones that are not pinned and hashes, we raise the errors typical of any unhashed requirement in --require-hashes mode.

Since the stanza under "if not ignore_dependencies" doesn't actually add anything if it's already in the RequirementSet, not much has to be done in the way of code: the unhashed deps don't have any hashes, so we complain about them as per usual.

Also...
* Revise wording of HashUnpinned errors. They can be raised even if no hash is specified, so the previous wording was misleading.
* Make wording of HashMissing less awkward.
Previously, Hash Verification, Editable Installs, Controlling setup_requires, and Build System Interface were all getting placed under it.
Those commands already checked hashes, since they use RequirementSet, where the hash-checking is done.

Reorder some options so pre, no-clean, and require-hashes are always in the same order.
For example, some-package==1.2 is pinned; some-package>1.2 is not.
"""
specifiers = self.specifier
return len(specifiers) == 1 and next(iter(specifiers)).operator == '=='
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can also pin with '==='

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a new one to me. What's that mean?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little bit odd here, because with PEP 440 == doesn't actually guarantee you'll always get the exact same thing. Local versions are not considered when checking == so if someone uploads 1.0+mylocalversion that will match ==1.0. However PyPI does not allow uploads of local versions, so it might be a moot point.

The === is a way to essentially escape PEP 440's strict parsing requirements for legacy versions that can't be successfully parsed as a PEP 440 version. It also provides a way to escape the == behavior around local versions.

Hashes from PyPI
~~~~~~~~~~~~~~~~

PyPI provides an md5 hash in the fragment portion of each package download
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not documenting ability to check other hash algos, like previous docs.

@qwcode
Copy link
Contributor

qwcode commented Oct 21, 2015

In the meantime, the only cost is a --use-wheel-cache-with-hashes flag

to be clear, your intention would be for this to mean "trust my cache", or "use my cache, but check it's hashes"?

@erikrose
Copy link
Contributor Author

"use my cache, but check its hashes"

IOW, "I have put hashes of the locally built wheels into my reqs file. If there is a locally built wheel, check it and use it; don't avoid it."

In light of the Horrible Truth, though, this option isn't such a good idea as I once thought, since those hashes turn out not to be very durable. I'll think about that some more.

@qwcode
Copy link
Contributor

qwcode commented Oct 21, 2015

here's a user story:

  • suppose some team adopts the new pip hashing feature (as it is now in this PR) for the sake of repeatability and security in their release pipeline
  • suppose they were already building and publishing wheels (when not a wheel already) to an internal index(es), i.e. they haven't been exercising the wheel cache as part of deployments
  • supppose they need and choose to maintain multiple versions of their requirements, a platform-neutral version with a neutral index and hashes, and others with platform-specific indexes and platform-specific wheel hashes
  • suppose a dev wants to use the nuetral version of the requirements in a non-standard environment (i.e. can't use an existing platform-specific index), but wants fast installs, i.e. wants a cache...
  • so, the idea (in a future PR) is that they'd use some new --use-wheel-cache-with-hashes flag, and maintain a private requirements file with their private hashes...

the last point is where I'm not sure. I'm thinking what people will want is just to have a flag to trust their cache (and not have to maintain their own requirements file), and know that later stages of the pipeline won't be so trusting...

but ultimately, that's a debate for a later PR...

Although, I'm still mulling this over a bit, I think I'm settled on the wheel cache being turned off on this PR for this feature

Removed the mention of "package index options" in the docs, because they don't all fit that category anymore. Not even --no-binary and --only-binary do; they're "install options".
@erikrose
Copy link
Contributor Author

It depends on what the "platform-neutral" version is. Is it a pure-Python wheel? Then the wheel cache doesn't even come into play. The wheel is fetched into the http cache, it's installed from there every time thereafter, and everybody's happy.

If it's a sdist, then the wheel cache comes into play.

But in the wake of the Horrible Truth, I'm leaning toward trashing --use-wheel-cache-with-hashes (since you'd forever be chasing changing compiler output) and having that mythical Later PR introduce an option to trust locally built wheels instead. As part of that, we'd want to add the hash of the sdist to the cache key, as you suggested, so that only actual malice, not just any old invalidation screwup, could hurt the install. Or we could maintain a mapping off to the side of sdist hashes to wheels, as dstufft said at one point.

@qwcode
Copy link
Contributor

qwcode commented Oct 21, 2015

It depends on what the "platform-neutral" version is. Is it a pure-Python wheel?

in this story, it was sdist and truly platform-neutral wheels

mythical Later PR introduce an option to trust locally built wheels instead
[...] add the hash of the sdist to the cache key

roger

@qwcode
Copy link
Contributor

qwcode commented Oct 21, 2015

+1 for merging

@erikrose
Copy link
Contributor Author

:-D

@erikrose
Copy link
Contributor Author

@dstufft What do you think? Ready to merge?

@qwcode
Copy link
Contributor

qwcode commented Oct 26, 2015

yea, I'd like to merge before this gets too stale.... cc @pfmoore @dstufft @rbtcollins @xavfernandez @Ivoz

@dstufft
Copy link
Member

dstufft commented Oct 26, 2015

I'll go over it again in a little bit.

@pfmoore
Copy link
Member

pfmoore commented Oct 26, 2015

I don't have the time to do a code review (nor is the subject something I'm particularly expert in anyway) but the discussion here seems thorough and I agree with the way things appear to have turned out, so it's OK with me.

@rbtcollins
Copy link

Will review shortly. This week is openstack summit, so may not be this week - sorry.

@jezdez
Copy link
Member

jezdez commented Nov 4, 2015

Just did another pass over the code and still +1 on merging this. Agreed about using a later PR to add support for locally cached wheel files.

@dstufft
Copy link
Member

dstufft commented Nov 6, 2015

This looks good to me. It has a merge conflict that once it's resolved I can merge (or I'll resolve it in a bit).

@dstufft dstufft mentioned this pull request Nov 6, 2015
@dstufft dstufft closed this in #3231 Nov 7, 2015
dstufft added a commit that referenced this pull request Nov 7, 2015
@erikrose
Copy link
Contributor Author

erikrose commented Nov 9, 2015

And there was much rejoicing! Thanks for the merge! :-D

@jezdez
Copy link
Member

jezdez commented Nov 9, 2015

woohoo!

@lock lock bot added the auto-locked Outdated issues that have been locked by automation label Jun 3, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Jun 3, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
auto-locked Outdated issues that have been locked by automation
Projects
None yet
Development

Successfully merging this pull request may close these issues.