option to confirm expected package hashes when installing #1175

Closed
qwcode opened this Issue Aug 29, 2013 · 33 comments

Comments

Projects
None yet
8 participants
Contributor

qwcode commented Aug 29, 2013

pip should offer some way to confirm expected hashes when installing
(mainly due to it being possible for the same pypi distribution to have a different hash over time; project authors are allowed to delete a distribution, and then upload a new one with the same name and version, but different hash.)

to be clear, pip already validates the hash url fragment in the PyPI link. this is not about doing that. this about confirming that you're getting the same hash that you've installed before, and want to keep getting.

peep offers this feature using requirement file comments: https://pypi.python.org/pypi/peep

as for implementing, here's one idea from @dstufft

"I keep playing with the idea of a lock file. e.g. in your requirements.txt you'd just specify your top level dependencies, then when you install it would resolve the dependencies, put them + hashes in requirements.txt.lock and next time you install it will look for a lockfile and install from there (implying --no-deps) so we don't get this thing where in order to do repeatable builds you basically have to manually pin every dependencies inside of a requirements file by hand"

cc @erikrose

Owner

dstufft commented Aug 29, 2013

One problem to deal with is that you may (rightly) get two different files for the same project + version, for instance if there's a binary Wheel uploaded for Windows you might get that on Windows (or not if you don't use --use-wheel) and you might get something completely different for Linux.

I think peep just ignores this problem and says that a peep flavored requirements file is only promised to be valid on using the same install time options and the same target system.

Contributor

erikrose commented Sep 6, 2013

So, we have a couple of options:

  1. A hole in the current requirements.txt grammar we can exploit is the "extras": pip install Paste[openid]. We could say that any "extra" with an equal sign in it is a key/value thing rather than an actual extra:

    Paste[sha256-win32=abcdefghjijklmnop, sha256-mac=zxcvbnmasdfghjkl]==1.2.3
    

    This would take care of peep hashes and any other future key/value needs. It wouldn't require distributors of requirements.txt files to change their grammars, and we wouldn't have to maintain two parallel parsers. It's a little ugly, but it's extensible, converting the hinky requirements.txt grammar into a tagged one without breaking compatibility with old pips.

  2. The lockfile idea. I like this a lot better. It's elegant, decoupled, and backward-compatible. It's fail-safe iff we require an option to be passed in order to pay attention to the lockfile (actually, let's call it a hashfile so as not to overload a term). We can automatically lay down the hashes nondestructively without having to tiptoe around existing comments and formatting in a requirements.txt.
    I imagine most people wouldn't check requirement-hashes.txt files into source control; they would mostly be maintained by deployers. But in case they wanted to, support for multiple platforms would be trivial: have a separate hashfile for each platform: requirement-hashes-win32.txt, requirement-hashes-win-amd64.txt, etc. Or we could multiplex them into one file if we felt strongly enough and felt like writing more parsing code—probably not worth it, IMO. Dealing with the wheel/non-wheel duality will take a little more thinking, but I don't think it's a killer.

I'm getting pretty close to starting to merge peep into pip. I'll probably take option number 2 unless somebody stops me. Any further thoughts? I'm happy to put forward a more concrete strawman proposal so we don't waste time upfront bikeshedding.

Contributor

erikrose commented Sep 6, 2013

One other horrible truth: currently, requirements.txt supports any pip version specifier, including things like >=. If people use those—a questionable idea anyway—it'll lead to a lot of spurious hash mismatches. I suppose the best thing to do is to intelligently mention that in the error message, but I wonder: does anyone else have philosophical thoughts about the use of non-== specifiers?

erikrose referenced this issue in erikrose/peep Sep 6, 2013

Closed

Merge peep functionality into pip #7

Contributor

pnasrat commented Sep 6, 2013

I don't think we can break the large number of users who may be using >=. It is definitely in use

https://github.com/search?l=&p=2&q=%3E%3D++path%3Arequirements.txt&ref=advsearch&type=Code

It's probably more widely used in private repos too.

To be fair you are breaking http://pythonhosted.org/setuptools/pkg_resources.html#requirement-objects

Which EBNF is

requirement  ::= project_name versionspec? extras?
versionspec  ::= comparison version (',' comparison version)*
comparison   ::= '<' | '<=' | '!=' | '==' | '>=' | '>'
extras       ::= '[' extralist? ']'
extralist    ::= identifier (',' identifier)*
project_name ::= identifier
identifier   ::= [-A-Za-z0-9_]+
version      ::= [-A-Za-z0-9_.]+

So the extralist item should be an identifier and not contain = as you use above.

Contributor

erikrose commented Sep 6, 2013

Yes, obviously we would be changing the grammar. But we'd break old versions of pip only when used against versions of requirements.txt that actually include hashes. That's actually a desired behavior, as it fails safe.

But, like I said, I favor option 2 anyway.

Contributor

erikrose commented Sep 6, 2013

In any case, I certainly didn't mean to suggest dropping support for range specifiers. I was more curious about anyone had worked out a semantic boundary between using them in requirements files and using them in install_requires. I have a pretty good mental model of what == is for in requirements files—known-working sets or frozen deployment sets—but I don't have a good use case in mind when it comes to greater-than and less-than.

Owner

dstufft commented Sep 9, 2013

I would prefer a lockfile (vs a hashfile) because Ideally it would do more than just hashes. I'm thinking the "normal' requirements.txt would no longer need to list every dependency and if the lockfile doesn't exist then pip will install from the requirements.txt, resolve the deps, write out a lockfile with a complete list of deps + hashes. If the lockfile does exist then it can just skip resolving deps and act like --no-deps was passed to it.

Owner

dstufft commented Sep 9, 2013

But that may be more than you want to get into. It's something i've sketched out in my head a few times and just hadn't had time to work on yet.

Contributor

qwcode commented Sep 9, 2013

One other horrible truth: currently, requirements.txt supports any pip version specifier,
including things like >=. If people use those—a questionable idea anyway—it'll lead to a lot of spurious hash
mismatches

You're focused on the repeatability use case for requirements files.

Currently, requirements files are just a way to get in a lot of pip install args. That's how the pip code treats them.

There are 3 main purposes I think:

  1. To achieve repeatability using == specifiers.
  2. As a way to override what will happen just based on install_requires, e.g. to resolve conflicts (that pip's "first found wins" resolver logic doesn't properly do) Many specifier patterns besides == can make sense for this.
  3. Saves typing and easier to reuse.
Contributor

qwcode commented Sep 9, 2013

I'm thinking the "normal' requirements.txt would no longer need to list every dependency

hmm, the phrase "no longer need" here concerns me.

whether it's a requirements file, or a "lock file", when repeatability is concerned, many people will want an overt user artifact that is managed in version control. Not sure of your intention, but the lock file can't just be a run-time by-product, that isn't specifically referenced.

and if the lockfile doesn't exist then pip will install from the requirements.txt, resolve the deps,

the lockfile would need to be something that can be used explictly, not just used if it exists.
otherwise, you can't ensure a repeatable process.

write out a lockfile with a complete list of deps + hashes.

btw, we can't assume write access next to the requirements file.
lock file creation would have to be an explicit command or option to some other command.

I'm concerned about growing a new "lock file" concept in addition to requirements files. We'll end up with three notions: install_requires, requirements files, and lock files. I'd thinking maybe just "requirements 2.0". Just a better requirements file format that supports hashes. Conceptually the same, just better.

Owner

dstufft commented Sep 9, 2013

I think even if we have requirements 2.0 (which is also something i'd love to do) I think we want a lockfile. Bundler has this and it's really nice.

In order to have a repeatable process right now you have to manually manage the entire dependency graph inside of the requirements file. Pinning every single dependency and managing updating them or removing them when no longer needed. A lock file means that you only need to declare what your project depends on and pip itself manages pinning the entire dependency graph.

In Bundler land you have Gemfile instead of requirements.txt and the lockfile is called Gemfile.lock, if Gemfile.lock exists it uses that otherwise it resolves the dependencies and creates Gemfile.lock and then installs it. If you want to force the use of a .lock you run bundle install --deployment instead of just bundle install.

Contributor

qwcode commented Sep 9, 2013

manually manage the entire dependency graph inside of the requirements file.
Pinning every single dependency and managing updating them or removing them when no longer needed.

manual? we have pip freeze to help with that.

In Bundler land you have Gemfile instead of requirements.txt and the lockfile is called Gemfile.lock

hmm, I would say gemfile is like our install_requires', and lock files are like pip freeze-generated requirements files. so, we have "lock files" now. that just don't handle hashes. I'm more concerned with filling in that deficit, not a major conceptual makeover, which is what this would be.

Owner

dstufft commented Sep 9, 2013

manual? we have pip freeze to help with that.

pip freeze helps but there's still a bunch of manual busy work to doing it. It's annoying to keep updated.

hmm, I would say gemfile is like our install_requires', and lock files are like pip freeze-generated requirements files.

No, install_requires would be gemspec. install_requires, requirements.txt, and the proposed lockfile do not include the same information (nor should they).

I'm more concerned with filling in that deficit, not a major conceptual makeover, which is what this would be.

I'm -1 on including another file for just the hashes. Either we should go all out and include a real lock file or we should figure out how to get the hashes into requirements.txt

Owner

dstufft commented Sep 9, 2013

To be specific.

  • install_requires - Holds "abstract" dependencies of a singular installable item. We want these to be as wide open as possible to prevent dependency hell.
  • requirements.txt - User facing file, users don't want to manually manage every single dependency and figure out who still depends on it (if anyone etc). Often times these can/should be ranges. (>=1.2,<1.3)
  • lockfile - Not user facing, has the exact version (==) pinned always. Already has the entire dependency graph solved so --no-deps can be assumed. Lots faster and easy to include things like hashes.

Now it's true you can approximate a lockfile with requirements.txt but that's not a very user friendly UX. At best case they'll want two files anyways and they'll need to delete a virtualenv, install everything, then pip freeze to their manually managed lockfile. At worst case they are going to be manually resolving the dependency graph inside of a single requirements.txt which is going to end up resolved wrong because humans are error prone.

Contributor

qwcode commented Sep 9, 2013

No, install_requires would be gemspec. install_requires, requirements.txt, and the proposed lockfile do not
include the same information (nor should they).

I don't want to linger on gem anymore, because I don't claim to know it, but I think gemfile just exists for the case where an "app" is not gemified or whatever. If the "app" were a gem, the gemfile and gemspec would likely be redundant.

install_requires - Holds "abstract" dependencies of a singular installable item

Understood.

Now it's true you can approximate a lockfile with requirements.txt

Again, "Locking" has been a defining use case for requirements files for years. Stripping "locking" away from requirements files is a big deal.

At best case they'll want two files anyways and they'll need to delete a virtualenv,
install everything, then pip freeze to their manually managed lockfile.

I'll post a more detailed breakdown of the 2 workflows. 1) the current approach using requirements files as lock files, and 2) the new lockfile approach.

Whether "apps" use a "setup.py" or not is an important point in analyzing this. I'll include that in my breakdown.

Contributor

qwcode commented Sep 9, 2013

Whether "apps" use a "setup.py" or not is an important point in analyzing this.

let me emphasize that. If you're default is to always create setup.py (and use install_requires) regardless of whether the app is going to be published, then the role of requirements.txt is simply for overrides and repeatability, if you need that.

otoh, if you're creating apps without a setup.py, then you'll end up wanting two requirements files (as @dstufft mentioned above). One to hold the "abstract" requirements, and the other for deployment repeatability.

Owner

dstufft commented Sep 9, 2013

I disagree about the setup.py bit. It may be that in some circumstances you can treat setup.py as your "human format" and requirements.txt as your "lockfile" but there are also circumstances when you can't. Such as if you have an app that optionally depends on something. Your setup.py would have it in extras and your requirements.txt would have the dependency as .[extraname] or so.

AFAICT there is exactly one case where you can truly treat setup.py as the input to to a requirements.txt as a lock file. Generally I have not seen many people in the wild using setup.py that way, but I have seen some. But importantly for those people the "human format" requirements.txt can be very simple and the lockfile will still be managed automatically for them so this would still make things easier and more streamlined for them.

One of the problems with pip freeze is it freezes the state of the current environment. Which means I have to make sure there's nothing else in that enviornment installed (such as ipython or a stale dependency) and if I want to pip freeze updated requirements I need to delete my virtualenv, recreate it, and reinstall).

Owner

dstufft commented Sep 9, 2013

Besides the fact that pip freeze loses information (for example if you install a tarball).

Contributor

qwcode commented Sep 9, 2013

it may be that in some circumstances you can treat setup.py as your "human format" and requirements.txt
as your "lockfile" but there are also circumstances when you can't

isn't extras the only case? ultimately, your app installation gets off the ground with a pip command. that could include the extras for your top-level app. Running the installation this way would be something that's done at specific times, when you're interested in dealing with upgrading dependency versions.

"human format"? it's the minimally-defined requirements that produce different results as pypi changes.

Contributor

qwcode commented Sep 9, 2013

isn't extras the only case?

well, there's VCS and url formats, if that's what you mean?

Owner

dstufft commented Sep 9, 2013

@qwcode If I have a Django project than "gunicorn" isn't a dependency of my application so it's not appropriate to put it in install_requires but I would want to put it in requirements.txt because for my deployment of my app that's what I'll be using.

Owner

dstufft commented Sep 9, 2013

Basically I believe it should be (in that it makes the most sense and streamlines things for the most use cases).

  • setup.py/install_requires/whl -> A single installable item (possibly with dependencies which are other single installable items)
  • requirements.txt -> a collection of "top" level things that I want to install
  • requirements.txt.lock -> a collection of everything I want to install.

The only case where you don't need 3 distinct formats is the case where your requirements.txt (as above) is a collection with a length of 1. As soon as your collection is more than one thing then you need an intermediate format to serve as the "collection of things to install".

As has also been touched on, setup.py also cannot properly specify every dependency type (because it contains abstract dependencies not concrete ones, see https://caremad.io/blog/setup-vs-requirement/)

Additionally you cannot specify where to install something from in a setup.py however you can in a requirement.txt (which is also an important feature to have).

Contributor

qwcode commented Sep 9, 2013

ok, this breakdown works for me. the abstract/concrete distinction is nice. good stuff to get into the packaging user guide.

About the synchronization between the files... Assuming it will be a practice to check in the lock files in some cases, I can imagine requirements getting "ahead" of the lock file in VCS (i.e. you update your requirements, but never install again before committing), but then the lock file get's used later as part of a deployment process in a separate environment (that only works against lock files). maybe the lock file needs to contain a hash of the source requirement file?

erikrose referenced this issue in erikrose/peep Mar 28, 2014

Closed

Append peep features directly in pip ? #30

  • setup.py/install_requires/whl -> A single installable item (possibly with dependencies which are other single installable items)
  • requirements.txt -> a collection of "top" level things that I want to install
  • requirements.txt.lock -> a collection of everything I want to install.

👍

Contributor

edmorley commented Sep 21, 2015

Having recently gotten used to Bundler's Gemfile.lock and npm shrinkwrap's npm-shrinkwrap.json, returning to pip and having to manually keep the list of inherited dependencies up to date in our requirements file is pretty painful.

About the synchronization between the files... Assuming it will be a practice to check in the lock files in some cases, I can imagine requirements getting "ahead" of the lock file in VCS (i.e. you update your requirements, but never install again before committing), but then the lock file get's used later as part of a deployment process in a separate environment (that only works against lock files). maybe the lock file needs to contain a hash of the source requirement file?

npm shrinkwrap (though it does have it's own set of bugs/issues) warns when out of sync - and things like npm install foo --save automatically update both package.json and npm-shrinkwrap.json (if the latter exists). Something similar would presumably avoid the problem you mention? :-)

Contributor

erikrose commented Sep 22, 2015

Btw, expect a PR from me in a day or two for basic hash checking.

Contributor

qwcode commented Sep 22, 2015

will it conflict with #3125 ?

Contributor

erikrose commented Sep 22, 2015

I don't think so.
On Sep 21, 2015 9:04 PM, Marcus Smith notifications@github.com wrote:will it conflict with #3125 ?

—Reply to this email directly or view it on GitHub.

Contributor

qwcode commented Sep 22, 2015

ok, no biggie if it did. I would just wait before merging to see what you have.
I'm curious to see it now : )

@erikrose erikrose added a commit to erikrose/pip that referenced this issue Sep 23, 2015

@erikrose erikrose Add checks against requirements-file-dwelling hashes for most kinds o…
…f packages. Close #1175.

* Add --require-hashes option. This is handy in deployment scripts to force application authors to hash their requirements. It is also a convenient way to get pip to show computed hashes for a virgin, unhashed requirements file. Eventually, additions to `pip freeze` should fill a superset of this use case.
  * In --require-hashes mode, at least one hash is required to match for each requirement.
  * Option-based requirements (--sha256=...) turn on --require-hashes mode implicitly.
  * Internet-derived URL-based hashes are "necessary but not sufficient": they do not satisfy --require-hashes mode when they match, but they are still used to guard against transmission errors.
  * Other URL-based requirements (#md5=...) are treated just like flag-based ones, except they don't turn on --require-hashes.
* Complain informatively, with the most devastating errors first so you don't chase your tail all day only to run up against a brick wall at the end. This also means we don't complain that a hash is missing, only for the user to find, after fixing it, that we have no idea how to even compute a hash for that type of requirement.
  * Complain about unpinned requirements when hash-checking mode is on, lest they cause the user surprise later.
  * Complain about missing hashes.
  * Complain about requirement types we don't know how to hash (like VCS ones and local dirs).
* Have InstallRequirement keep its original Link around (original_link) so we can differentiate between URL hashes from requirements files and ones downloaded from the (untrustworthy) internet.
* Remove test_download_hashes, which is obsolete. Similar coverage is provided in test_utils.TestHashes and the various hash cases in test_req.py.
241a2ec

@erikrose erikrose added a commit to erikrose/pip that referenced this issue Sep 24, 2015

@erikrose erikrose Add checks against requirements-file-dwelling hashes for most kinds o…
…f packages. Close #1175.

* Add --require-hashes option. This is handy in deployment scripts to force application authors to hash their requirements. It is also a convenient way to get pip to show computed hashes for a virgin, unhashed requirements file. Eventually, additions to `pip freeze` should fill a superset of this use case.
  * In --require-hashes mode, at least one hash is required to match for each requirement.
  * Option-based requirements (--sha256=...) turn on --require-hashes mode implicitly.
  * Internet-derived URL-based hashes are "necessary but not sufficient": they do not satisfy --require-hashes mode when they match, but they are still used to guard against transmission errors.
  * Other URL-based requirements (#md5=...) are treated just like flag-based ones, except they don't turn on --require-hashes.
* Complain informatively, with the most devastating errors first so you don't chase your tail all day only to run up against a brick wall at the end. This also means we don't complain that a hash is missing, only for the user to find, after fixing it, that we have no idea how to even compute a hash for that type of requirement.
  * Complain about unpinned requirements when hash-checking mode is on, lest they cause the user surprise later.
  * Complain about missing hashes.
  * Complain about requirement types we don't know how to hash (like VCS ones and local dirs).
* Have InstallRequirement keep its original Link around (original_link) so we can differentiate between URL hashes from requirements files and ones downloaded from the (untrustworthy) internet.
* Remove test_download_hashes, which is obsolete. Similar coverage is provided in test_utils.TestHashes and the various hash cases in test_req.py.
9cb2497
Contributor

xavfernandez commented Oct 7, 2015

Btw this seems strongly related to #468

dstufft closed this in 1e41f01 Nov 7, 2015

ewdurbin referenced this issue in pypa/pypi-legacy Jun 18, 2016

Closed

Order of archives on index page is not stable #64

Owner

pfmoore commented Sep 1, 2016

It looks like commit 1e41f01 introduced a test that fails on Windows (sorry, I can't track back to the PR that contained the commit). @dstufft @erikrose do either of you have any insight? The test is

    def test_unsupported_hashes(self, data):
        """VCS and dir links should raise errors when --require-hashes is
        on.

        In addition, complaints about the type of requirement (VCS or dir)
        should trump the presence or absence of a hash.

        """
        reqset = self.basic_reqset(require_hashes=True)
        reqset.add_requirement(
            list(process_line(
                'git+git://github.com/pypa/pip-test-package --hash=sha256:123',
                'file',
                1))[0])
        dir_path = data.packages.join('FSPkg')
        reqset.add_requirement(
            list(process_line(
                'file://%s' % (dir_path,),
                'file',
                2))[0])
        finder = PackageFinder([data.find_links], [], session=PipSession())
        sep = os.path.sep
        if sep == '\\':
            sep = '\\\\'  # This needs to be escaped for the regex
        assert_raises_regexp(
            HashErrors,
            r"Can't verify hashes for these requirements because we don't "
            r"have a way to hash version control repositories:\n"
            r"    git\+git://github\.com/pypa/pip-test-package \(from -r file "
            r"\(line 1\)\)\n"
            r"Can't verify hashes for these file:// requirements because they "
            r"point to directories:\n"
            r"    file://.*{sep}data{sep}packages{sep}FSPkg "
            "\(from -r file \(line 2\)\)".format(sep=sep),
            reqset.prepare_files,
            finder)

and it fails with an error in the url2pathname function in the stdlib:

    def url2pathname(url):                                                                                                                  
        """OS-specific conversion from a relative URL of the 'file' scheme                                                                  
        to a file system path; not recommended for general use."""                                                                          
        # e.g.                                                                                                                              
        #   ///C|/foo/bar/spam.foo                                                                                                          
        # and                                                                                                                               
        #   ///C:/foo/bar/spam.foo                                                                                                          
        # become                                                                                                                            
        #   C:\foo\bar\spam.foo                                                                                                             
        import string, urllib.parse                                                                                                         
        # Windows itself uses ":" even in URLs.                                                                                             
        url = url.replace(':', '|')                                                                                                         
        if not '|' in url:                                                                                                                  
            # No drive specifier, just convert slashes                                                                                      
            if url[:4] == '////':                                                                                                           
                # path is something like ////host/path/on/remote/host                                                                       
                # convert this to \\host\path\on\remote\host                                                                                
                # (notice halving of slashes at the start of the path)                                                                      
                url = url[2:]                                                                                                               
            components = url.split('/')                                                                                                     
            # make sure not to convert quoted slashes :-)                                                                                   
            return urllib.parse.unquote('\\'.join(components))                                                                              
        comp = url.split('|')                                                                                                               
        if len(comp) != 2 or comp[0][-1] not in string.ascii_letters:                                                                       
            error = 'Bad URL: ' + url                                                                                                       
>           raise OSError(error)                                                                                                            
E           OSError: Bad URL: \\C|\Users\Gustav\AppData\Local\Temp\pytest-1\popen-gw3\test_unsupported_hashes0\data\packages\FSPkg///C|/work/projects/pip

It looks like a problem with the arcane rules for how Windows drive letters get encoded in URLs. But I don't understand the logic well enough to fix it. For now, I've marked this test as an expected fail on Windows, but someone should take a look at it.

Contributor

erikrose commented Sep 1, 2016

@pfmoore It sounds worth opening a separate issue about. Do you have a traceback handy? I might be able to take a guess if I could see one.

Owner

pfmoore commented Sep 2, 2016

@erikrose Sorry, yes - I should have done that in the first place. #3951 raised.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment