# option to confirm expected package hashes when installing #1175

Closed
opened this Issue Aug 29, 2013 · 33 comments

Projects
None yet
8 participants
Contributor

### qwcode commented Aug 29, 2013

 pip should offer some way to confirm expected hashes when installing (mainly due to it being possible for the same pypi distribution to have a different hash over time; project authors are allowed to delete a distribution, and then upload a new one with the same name and version, but different hash.) to be clear, pip already validates the hash url fragment in the PyPI link. this is not about doing that. this about confirming that you're getting the same hash that you've installed before, and want to keep getting. peep offers this feature using requirement file comments: https://pypi.python.org/pypi/peep as for implementing, here's one idea from @dstufft "I keep playing with the idea of a lock file. e.g. in your requirements.txt you'd just specify your top level dependencies, then when you install it would resolve the dependencies, put them + hashes in requirements.txt.lock and next time you install it will look for a lockfile and install from there (implying --no-deps) so we don't get this thing where in order to do repeatable builds you basically have to manually pin every dependencies inside of a requirements file by hand"
Owner

### dstufft commented Aug 29, 2013

 One problem to deal with is that you may (rightly) get two different files for the same project + version, for instance if there's a binary Wheel uploaded for Windows you might get that on Windows (or not if you don't use --use-wheel) and you might get something completely different for Linux. I think peep just ignores this problem and says that a peep flavored requirements file is only promised to be valid on using the same install time options and the same target system.
Contributor

### erikrose commented Sep 6, 2013

 So, we have a couple of options: A hole in the current requirements.txt grammar we can exploit is the "extras": pip install Paste[openid]. We could say that any "extra" with an equal sign in it is a key/value thing rather than an actual extra: Paste[sha256-win32=abcdefghjijklmnop, sha256-mac=zxcvbnmasdfghjkl]==1.2.3  This would take care of peep hashes and any other future key/value needs. It wouldn't require distributors of requirements.txt files to change their grammars, and we wouldn't have to maintain two parallel parsers. It's a little ugly, but it's extensible, converting the hinky requirements.txt grammar into a tagged one without breaking compatibility with old pips. The lockfile idea. I like this a lot better. It's elegant, decoupled, and backward-compatible. It's fail-safe iff we require an option to be passed in order to pay attention to the lockfile (actually, let's call it a hashfile so as not to overload a term). We can automatically lay down the hashes nondestructively without having to tiptoe around existing comments and formatting in a requirements.txt. I imagine most people wouldn't check requirement-hashes.txt files into source control; they would mostly be maintained by deployers. But in case they wanted to, support for multiple platforms would be trivial: have a separate hashfile for each platform: requirement-hashes-win32.txt, requirement-hashes-win-amd64.txt, etc. Or we could multiplex them into one file if we felt strongly enough and felt like writing more parsing code—probably not worth it, IMO. Dealing with the wheel/non-wheel duality will take a little more thinking, but I don't think it's a killer. I'm getting pretty close to starting to merge peep into pip. I'll probably take option number 2 unless somebody stops me. Any further thoughts? I'm happy to put forward a more concrete strawman proposal so we don't waste time upfront bikeshedding.
Contributor

### erikrose commented Sep 6, 2013

 One other horrible truth: currently, requirements.txt supports any pip version specifier, including things like >=. If people use those—a questionable idea anyway—it'll lead to a lot of spurious hash mismatches. I suppose the best thing to do is to intelligently mention that in the error message, but I wonder: does anyone else have philosophical thoughts about the use of non-== specifiers?

Closed

Contributor

### pnasrat commented Sep 6, 2013

 I don't think we can break the large number of users who may be using >=. It is definitely in use https://github.com/search?l=&p=2&q=%3E%3D++path%3Arequirements.txt&ref=advsearch&type=Code It's probably more widely used in private repos too. To be fair you are breaking http://pythonhosted.org/setuptools/pkg_resources.html#requirement-objects Which EBNF is requirement ::= project_name versionspec? extras? versionspec ::= comparison version (',' comparison version)* comparison ::= '<' | '<=' | '!=' | '==' | '>=' | '>' extras ::= '[' extralist? ']' extralist ::= identifier (',' identifier)* project_name ::= identifier identifier ::= [-A-Za-z0-9_]+ version ::= [-A-Za-z0-9_.]+  So the extralist item should be an identifier and not contain = as you use above.
Contributor

### erikrose commented Sep 6, 2013

 Yes, obviously we would be changing the grammar. But we'd break old versions of pip only when used against versions of requirements.txt that actually include hashes. That's actually a desired behavior, as it fails safe. But, like I said, I favor option 2 anyway.
Contributor

### erikrose commented Sep 6, 2013

 In any case, I certainly didn't mean to suggest dropping support for range specifiers. I was more curious about anyone had worked out a semantic boundary between using them in requirements files and using them in install_requires. I have a pretty good mental model of what == is for in requirements files—known-working sets or frozen deployment sets—but I don't have a good use case in mind when it comes to greater-than and less-than.
Owner

### dstufft commented Sep 9, 2013

 I would prefer a lockfile (vs a hashfile) because Ideally it would do more than just hashes. I'm thinking the "normal' requirements.txt would no longer need to list every dependency and if the lockfile doesn't exist then pip will install from the requirements.txt, resolve the deps, write out a lockfile with a complete list of deps + hashes. If the lockfile does exist then it can just skip resolving deps and act like --no-deps was passed to it.
Owner

### dstufft commented Sep 9, 2013

 But that may be more than you want to get into. It's something i've sketched out in my head a few times and just hadn't had time to work on yet.
Contributor

### qwcode commented Sep 9, 2013

 One other horrible truth: currently, requirements.txt supports any pip version specifier, including things like >=. If people use those—a questionable idea anyway—it'll lead to a lot of spurious hash mismatches You're focused on the repeatability use case for requirements files. Currently, requirements files are just a way to get in a lot of pip install args. That's how the pip code treats them. There are 3 main purposes I think: To achieve repeatability using == specifiers. As a way to override what will happen just based on install_requires, e.g. to resolve conflicts (that pip's "first found wins" resolver logic doesn't properly do) Many specifier patterns besides == can make sense for this. Saves typing and easier to reuse.
Contributor

### qwcode commented Sep 9, 2013

 I'm thinking the "normal' requirements.txt would no longer need to list every dependency hmm, the phrase "no longer need" here concerns me. whether it's a requirements file, or a "lock file", when repeatability is concerned, many people will want an overt user artifact that is managed in version control. Not sure of your intention, but the lock file can't just be a run-time by-product, that isn't specifically referenced. and if the lockfile doesn't exist then pip will install from the requirements.txt, resolve the deps, the lockfile would need to be something that can be used explictly, not just used if it exists. otherwise, you can't ensure a repeatable process. write out a lockfile with a complete list of deps + hashes. btw, we can't assume write access next to the requirements file. lock file creation would have to be an explicit command or option to some other command. I'm concerned about growing a new "lock file" concept in addition to requirements files. We'll end up with three notions: install_requires, requirements files, and lock files. I'd thinking maybe just "requirements 2.0". Just a better requirements file format that supports hashes. Conceptually the same, just better.
Owner

### dstufft commented Sep 9, 2013

 I think even if we have requirements 2.0 (which is also something i'd love to do) I think we want a lockfile. Bundler has this and it's really nice. In order to have a repeatable process right now you have to manually manage the entire dependency graph inside of the requirements file. Pinning every single dependency and managing updating them or removing them when no longer needed. A lock file means that you only need to declare what your project depends on and pip itself manages pinning the entire dependency graph. In Bundler land you have Gemfile instead of requirements.txt and the lockfile is called Gemfile.lock, if Gemfile.lock exists it uses that otherwise it resolves the dependencies and creates Gemfile.lock and then installs it. If you want to force the use of a .lock you run bundle install --deployment instead of just bundle install.
Contributor

### qwcode commented Sep 9, 2013

 manually manage the entire dependency graph inside of the requirements file. Pinning every single dependency and managing updating them or removing them when no longer needed. manual? we have pip freeze to help with that. In Bundler land you have Gemfile instead of requirements.txt and the lockfile is called Gemfile.lock hmm, I would say gemfile is like our install_requires', and lock files are like pip freeze-generated requirements files. so, we have "lock files" now. that just don't handle hashes. I'm more concerned with filling in that deficit, not a major conceptual makeover, which is what this would be.
Owner

### dstufft commented Sep 9, 2013

 manual? we have pip freeze to help with that. pip freeze helps but there's still a bunch of manual busy work to doing it. It's annoying to keep updated. hmm, I would say gemfile is like our install_requires', and lock files are like pip freeze-generated requirements files. No, install_requires would be gemspec. install_requires, requirements.txt, and the proposed lockfile do not include the same information (nor should they). I'm more concerned with filling in that deficit, not a major conceptual makeover, which is what this would be. I'm -1 on including another file for just the hashes. Either we should go all out and include a real lock file or we should figure out how to get the hashes into requirements.txt
Owner

### dstufft commented Sep 9, 2013

 To be specific. install_requires - Holds "abstract" dependencies of a singular installable item. We want these to be as wide open as possible to prevent dependency hell. requirements.txt - User facing file, users don't want to manually manage every single dependency and figure out who still depends on it (if anyone etc). Often times these can/should be ranges. (>=1.2,<1.3) lockfile - Not user facing, has the exact version (==) pinned always. Already has the entire dependency graph solved so --no-deps can be assumed. Lots faster and easy to include things like hashes. Now it's true you can approximate a lockfile with requirements.txt but that's not a very user friendly UX. At best case they'll want two files anyways and they'll need to delete a virtualenv, install everything, then pip freeze to their manually managed lockfile. At worst case they are going to be manually resolving the dependency graph inside of a single requirements.txt which is going to end up resolved wrong because humans are error prone.
Contributor

### qwcode commented Sep 9, 2013

 No, install_requires would be gemspec. install_requires, requirements.txt, and the proposed lockfile do not include the same information (nor should they). I don't want to linger on gem anymore, because I don't claim to know it, but I think gemfile just exists for the case where an "app" is not gemified or whatever. If the "app" were a gem, the gemfile and gemspec would likely be redundant. install_requires - Holds "abstract" dependencies of a singular installable item Understood. Now it's true you can approximate a lockfile with requirements.txt Again, "Locking" has been a defining use case for requirements files for years. Stripping "locking" away from requirements files is a big deal. At best case they'll want two files anyways and they'll need to delete a virtualenv, install everything, then pip freeze to their manually managed lockfile. I'll post a more detailed breakdown of the 2 workflows. 1) the current approach using requirements files as lock files, and 2) the new lockfile approach. Whether "apps" use a "setup.py" or not is an important point in analyzing this. I'll include that in my breakdown.
Contributor

### qwcode commented Sep 9, 2013

 Whether "apps" use a "setup.py" or not is an important point in analyzing this. let me emphasize that. If you're default is to always create setup.py (and use install_requires) regardless of whether the app is going to be published, then the role of requirements.txt is simply for overrides and repeatability, if you need that. otoh, if you're creating apps without a setup.py, then you'll end up wanting two requirements files (as @dstufft mentioned above). One to hold the "abstract" requirements, and the other for deployment repeatability.
Owner

### dstufft commented Sep 9, 2013

 I disagree about the setup.py bit. It may be that in some circumstances you can treat setup.py as your "human format" and requirements.txt as your "lockfile" but there are also circumstances when you can't. Such as if you have an app that optionally depends on something. Your setup.py would have it in extras and your requirements.txt would have the dependency as .[extraname] or so. AFAICT there is exactly one case where you can truly treat setup.py as the input to to a requirements.txt as a lock file. Generally I have not seen many people in the wild using setup.py that way, but I have seen some. But importantly for those people the "human format" requirements.txt can be very simple and the lockfile will still be managed automatically for them so this would still make things easier and more streamlined for them. One of the problems with pip freeze is it freezes the state of the current environment. Which means I have to make sure there's nothing else in that enviornment installed (such as ipython or a stale dependency) and if I want to pip freeze updated requirements I need to delete my virtualenv, recreate it, and reinstall).
Owner

### dstufft commented Sep 9, 2013

 Besides the fact that pip freeze loses information (for example if you install a tarball).
Contributor

### qwcode commented Sep 9, 2013

 it may be that in some circumstances you can treat setup.py as your "human format" and requirements.txt as your "lockfile" but there are also circumstances when you can't isn't extras the only case? ultimately, your app installation gets off the ground with a pip command. that could include the extras for your top-level app. Running the installation this way would be something that's done at specific times, when you're interested in dealing with upgrading dependency versions. "human format"? it's the minimally-defined requirements that produce different results as pypi changes.
Contributor

### qwcode commented Sep 9, 2013

 isn't extras the only case? well, there's VCS and url formats, if that's what you mean?
Owner

### dstufft commented Sep 9, 2013

 @qwcode If I have a Django project than "gunicorn" isn't a dependency of my application so it's not appropriate to put it in install_requires but I would want to put it in requirements.txt because for my deployment of my app that's what I'll be using.

Closed

Owner

### dstufft commented Sep 9, 2013

 Basically I believe it should be (in that it makes the most sense and streamlines things for the most use cases). setup.py/install_requires/whl -> A single installable item (possibly with dependencies which are other single installable items) requirements.txt -> a collection of "top" level things that I want to install requirements.txt.lock -> a collection of everything I want to install. The only case where you don't need 3 distinct formats is the case where your requirements.txt (as above) is a collection with a length of 1. As soon as your collection is more than one thing then you need an intermediate format to serve as the "collection of things to install". As has also been touched on, setup.py also cannot properly specify every dependency type (because it contains abstract dependencies not concrete ones, see https://caremad.io/blog/setup-vs-requirement/) Additionally you cannot specify where to install something from in a setup.py however you can in a requirement.txt (which is also an important feature to have).
Contributor

### qwcode commented Sep 9, 2013

 ok, this breakdown works for me. the abstract/concrete distinction is nice. good stuff to get into the packaging user guide. About the synchronization between the files... Assuming it will be a practice to check in the lock files in some cases, I can imagine requirements getting "ahead" of the lock file in VCS (i.e. you update your requirements, but never install again before committing), but then the lock file get's used later as part of a deployment process in a separate environment (that only works against lock files). maybe the lock file needs to contain a hash of the source requirement file?

Closed

### westurner commented Mar 29, 2014

 setup.py/install_requires/whl -> A single installable item (possibly with dependencies which are other single installable items) requirements.txt -> a collection of "top" level things that I want to install requirements.txt.lock -> a collection of everything I want to install. 👍

Closed

Closed

Contributor

### edmorley commented Sep 21, 2015

 Having recently gotten used to Bundler's Gemfile.lock and npm shrinkwrap's npm-shrinkwrap.json, returning to pip and having to manually keep the list of inherited dependencies up to date in our requirements file is pretty painful. About the synchronization between the files... Assuming it will be a practice to check in the lock files in some cases, I can imagine requirements getting "ahead" of the lock file in VCS (i.e. you update your requirements, but never install again before committing), but then the lock file get's used later as part of a deployment process in a separate environment (that only works against lock files). maybe the lock file needs to contain a hash of the source requirement file? npm shrinkwrap (though it does have it's own set of bugs/issues) warns when out of sync - and things like npm install foo --save automatically update both package.json and npm-shrinkwrap.json (if the latter exists). Something similar would presumably avoid the problem you mention? :-)
Contributor

### erikrose commented Sep 22, 2015

 Btw, expect a PR from me in a day or two for basic hash checking.
Contributor

### qwcode commented Sep 22, 2015

 will it conflict with #3125 ?
Contributor

### erikrose commented Sep 22, 2015

 I don't think so. On Sep 21, 2015 9:04 PM, Marcus Smith notifications@github.com wrote:will it conflict with #3125 ? —Reply to this email directly or view it on GitHub.
Contributor

### qwcode commented Sep 22, 2015

 ok, no biggie if it did. I would just wait before merging to see what you have. I'm curious to see it now : )

### erikrose added a commit to erikrose/pip that referenced this issue Sep 23, 2015

 erikrose Add checks against requirements-file-dwelling hashes for most kinds o… …f packages. Close #1175. * Add --require-hashes option. This is handy in deployment scripts to force application authors to hash their requirements. It is also a convenient way to get pip to show computed hashes for a virgin, unhashed requirements file. Eventually, additions to pip freeze should fill a superset of this use case. * In --require-hashes mode, at least one hash is required to match for each requirement. * Option-based requirements (--sha256=...) turn on --require-hashes mode implicitly. * Internet-derived URL-based hashes are "necessary but not sufficient": they do not satisfy --require-hashes mode when they match, but they are still used to guard against transmission errors. * Other URL-based requirements (#md5=...) are treated just like flag-based ones, except they don't turn on --require-hashes. * Complain informatively, with the most devastating errors first so you don't chase your tail all day only to run up against a brick wall at the end. This also means we don't complain that a hash is missing, only for the user to find, after fixing it, that we have no idea how to even compute a hash for that type of requirement. * Complain about unpinned requirements when hash-checking mode is on, lest they cause the user surprise later. * Complain about missing hashes. * Complain about requirement types we don't know how to hash (like VCS ones and local dirs). * Have InstallRequirement keep its original Link around (original_link) so we can differentiate between URL hashes from requirements files and ones downloaded from the (untrustworthy) internet. * Remove test_download_hashes, which is obsolete. Similar coverage is provided in test_utils.TestHashes and the various hash cases in test_req.py. 241a2ec

### erikrose added a commit to erikrose/pip that referenced this issue Sep 24, 2015

 erikrose Add checks against requirements-file-dwelling hashes for most kinds o… …f packages. Close #1175. * Add --require-hashes option. This is handy in deployment scripts to force application authors to hash their requirements. It is also a convenient way to get pip to show computed hashes for a virgin, unhashed requirements file. Eventually, additions to pip freeze should fill a superset of this use case. * In --require-hashes mode, at least one hash is required to match for each requirement. * Option-based requirements (--sha256=...) turn on --require-hashes mode implicitly. * Internet-derived URL-based hashes are "necessary but not sufficient": they do not satisfy --require-hashes mode when they match, but they are still used to guard against transmission errors. * Other URL-based requirements (#md5=...) are treated just like flag-based ones, except they don't turn on --require-hashes. * Complain informatively, with the most devastating errors first so you don't chase your tail all day only to run up against a brick wall at the end. This also means we don't complain that a hash is missing, only for the user to find, after fixing it, that we have no idea how to even compute a hash for that type of requirement. * Complain about unpinned requirements when hash-checking mode is on, lest they cause the user surprise later. * Complain about missing hashes. * Complain about requirement types we don't know how to hash (like VCS ones and local dirs). * Have InstallRequirement keep its original Link around (original_link) so we can differentiate between URL hashes from requirements files and ones downloaded from the (untrustworthy) internet. * Remove test_download_hashes, which is obsolete. Similar coverage is provided in test_utils.TestHashes and the various hash cases in test_req.py. 9cb2497

Closed

Contributor

### xavfernandez commented Oct 7, 2015

 Btw this seems strongly related to #468

Closed

Owner

### pfmoore commented Sep 1, 2016

 It looks like commit 1e41f01 introduced a test that fails on Windows (sorry, I can't track back to the PR that contained the commit). @dstufft @erikrose do either of you have any insight? The test is  def test_unsupported_hashes(self, data): """VCS and dir links should raise errors when --require-hashes is on. In addition, complaints about the type of requirement (VCS or dir) should trump the presence or absence of a hash. """ reqset = self.basic_reqset(require_hashes=True) reqset.add_requirement( list(process_line( 'git+git://github.com/pypa/pip-test-package --hash=sha256:123', 'file', 1))[0]) dir_path = data.packages.join('FSPkg') reqset.add_requirement( list(process_line( 'file://%s' % (dir_path,), 'file', 2))[0]) finder = PackageFinder([data.find_links], [], session=PipSession()) sep = os.path.sep if sep == '\\': sep = '\\\\' # This needs to be escaped for the regex assert_raises_regexp( HashErrors, r"Can't verify hashes for these requirements because we don't " r"have a way to hash version control repositories:\n" r" git\+git://github\.com/pypa/pip-test-package $$from -r file " r"\(line 1$$\)\n" r"Can't verify hashes for these file:// requirements because they " r"point to directories:\n" r" file://.*{sep}data{sep}packages{sep}FSPkg " "$$from -r file \(line 2$$\)".format(sep=sep), reqset.prepare_files, finder) and it fails with an error in the url2pathname function in the stdlib:  def url2pathname(url): """OS-specific conversion from a relative URL of the 'file' scheme to a file system path; not recommended for general use.""" # e.g. # ///C|/foo/bar/spam.foo # and # ///C:/foo/bar/spam.foo # become # C:\foo\bar\spam.foo import string, urllib.parse # Windows itself uses ":" even in URLs. url = url.replace(':', '|') if not '|' in url: # No drive specifier, just convert slashes if url[:4] == '////': # path is something like ////host/path/on/remote/host # convert this to \\host\path\on\remote\host # (notice halving of slashes at the start of the path) url = url[2:] components = url.split('/') # make sure not to convert quoted slashes :-) return urllib.parse.unquote('\\'.join(components)) comp = url.split('|') if len(comp) != 2 or comp[0][-1] not in string.ascii_letters: error = 'Bad URL: ' + url > raise OSError(error) E OSError: Bad URL: \\C|\Users\Gustav\AppData\Local\Temp\pytest-1\popen-gw3\test_unsupported_hashes0\data\packages\FSPkg///C|/work/projects/pip  It looks like a problem with the arcane rules for how Windows drive letters get encoded in URLs. But I don't understand the logic well enough to fix it. For now, I've marked this test as an expected fail on Windows, but someone should take a look at it.
Contributor

### erikrose commented Sep 1, 2016

 @pfmoore It sounds worth opening a separate issue about. Do you have a traceback handy? I might be able to take a guess if I could see one.
Owner

### pfmoore commented Sep 2, 2016

 @erikrose Sorry, yes - I should have done that in the first place. #3951 raised.