Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix unnecessary downloads by pipenv lock #3827 #3830

Closed
wants to merge 3 commits into from

Conversation

@pilkibun
Copy link

commented Jul 9, 2019

Closes #3827

The issue

pipenv lock retrieves all possible artifacts (even for other platforms/arch) in order to calculate to their hash, even when Pypi includes the sha256 hash in the url. This means lengthy downloads (In one case 893MB vs. 50MB), which users experience as pipenv hanging.

The fix

If the artifact url already contains a hash value, using the same hash function which pipenv would have used, we simply extract and use it in directly without downloading anything. It is also stored in the hash cache.

The checklist

  • Associated issue
  • A news fragment in the news/ directory

If this is a patch to the vendor directory…

This fixes an existing vendor patch merged in cb895aa.

@frostming

This comment has been minimized.

Copy link
Collaborator

commented Jul 9, 2019

It's a change of upstream library, but I am OK to this. Just need another look by @techalchemy @jxltom @uranusjr

@uranusjr

This comment has been minimized.

Copy link
Member

commented Jul 9, 2019

Edit: Oh nvm, this is a Pipenv-only patch.

@uranusjr

This comment has been minimized.

Copy link
Member

commented Jul 9, 2019

I think I am OK with the implementation, but not entire sure whether it is a good idea to trust the URL hash. But OTOH, if the URL could be compromised, so can the downloaded package, so I guess in philosophy none is better than the other anyway, you have to trust one of them eventually.

@techalchemy

This comment has been minimized.

Copy link
Member

commented Jul 9, 2019

I guess I have mixed feelings about this. I've been saying for some time now that 'hash checking' is what takes a long time, but in reality it's the fact that we are requiring the download of multiple platforms worth of every artifact in order to lock due to our patches on pip.

I'm a firm no on this specific PR for security reasons (we discussed the different approaches early in the project; the Hash Cache is itself patched into piptools and not native) and for security reasons obviously we opted to download the contents and build the hashes that way.

From a reproducibilty standpoint however, it gains you nothing to download and hash the artifacts for other platforms which you yourself will not be installing, since you can't use them and aren't building your environment with them. You will never have any idea if they actually work properly. On the other hand, simply relying on a hash from a URL as a manner of trust is highly insecure and doesn't even require an SSL handshake / there is no certificate verification even, and that was the reason we never agreed to do it that way. That reasoning, fundamentally, hasn't changed, so I'm not too sure our approach on that specific point should.

I am open to alternative options there though, clearly downloading the entire internet worth of packages is not a great way of handling this...

@frostming

This comment has been minimized.

Copy link
Collaborator

commented Jul 9, 2019

Yeah, I think it is the purpose of the original commit cb895aa

It only stores cache when the hash part exists in the URL, but never extracts it, which looks like kind of mistake.

EDIT

I got some insight into the reason for doing this from the comment of @techalchemy . What do you think about taking advantage of some "pypi.org only" methods(/pypi/{package}/json endpoint) to retrieve the package meta, and fall back to the file hash if not available.

@techalchemy

This comment has been minimized.

Copy link
Member

commented Jul 9, 2019

hmmm using the json endpoint is an interesting idea, it's probably equally secure compared with downloading from there, right? The content is still encrypted / we will still be doing the SSL handshake yeah?

/cc @nateprewitt I know you had concerns about this 2 years ago when we implemented it

@uranusjr

This comment has been minimized.

Copy link
Member

commented Jul 9, 2019

Using the JSON endpoint is fundamentally the same as using the URL-provided hash. They are both strings from the index server. I’m firmly -1 on the JSON endpoint. Just use the Simple page; if that is a bad idea, so is JSON.

@pilkibun

This comment has been minimized.

Copy link
Author

commented Jul 10, 2019

On consideration, this whole discussion is a moot point.

pipenv must download/build artifacts and resolve dependencies once in order to install them. That means by the time lock is run, it should already have all the information it needs to reproduce the build.

pip-tools seems to follow this "fetch once" model. Why doesn't pipenv?

@uranusjr

This comment has been minimized.

Copy link
Member

commented Jul 11, 2019

It’s not entirely moot IMO. Pipenv only needs to download one artifact (ideally) for resolution, the URL-provided hash can still avoid some downloads.

@techalchemy

This comment has been minimized.

Copy link
Member

commented Jul 11, 2019

pipenv must download/build artifacts and resolve dependencies once in order to install them. That means by the time lock is run, it should already have all the information it needs to reproduce the build.

We obviously would not want to download artifacts more than one time, I've been saying everywhere this issue comes up that if someone can identify whether this is actually occurring and how we can fix it, we will absolutely merge that change.

@techalchemy

This comment has been minimized.

Copy link
Member

commented Jul 11, 2019

The same argument can be made for retrieving the artifact itself, as pointed out. If you trust the server, you can trust its hash hint. The PR should definitely require a trusted connection, but I don't see how following a url provided by the server is considered secure, while the hash information in the very same url is considered insecure. If you feel differently, can you explain the threat model or point to the early discussions you mentioned?

The point about hash hints is again not a point about trust specifically, it's a point about reproducibilty. If you are downloading the artifact, we want to ensure that we know exactly which artifacts represent the valid set of options at a given point in time. We can only actually make a guarantee about the one you installed, which is why it might make sense to avoid the downloads entirely in the alternative cases.

@techalchemy

This comment has been minimized.

Copy link
Member

commented Jul 14, 2019

So you're concerned about the case in which the artifact store (due to a bug, not an attack), reports the wrong hash hint for the artifact?

No, I'm concerned about using hash hints at all, which is a fundamental shift in our implementation.

The only reason I am considering it is because pipenv currently downloads and hashes artifacts for each platform and python version, no matter whether it is actively performing an installation with those artifacts. However, it also caches the outcome in a local hash cache. If we switched to using the hash hints, would we use them in place of those supplied in the existing pipfile?

The difference here is that we are only using a single artifact, so actually performing the hash is only telling us anything useful about that individual artifact (i.e. from a reproducibility standpoint, we can only be sure that the one specific artifact works the same, since we didn't install any of the others).

To address a point @uranusjr made the other day:

Using the JSON endpoint is fundamentally the same as using the URL-provided hash. They are both strings from the index server. I’m firmly -1 on the JSON endpoint. Just use the Simple page; if that is a bad idea, so is JSON.

It's not fundamentally the same since the proposal is to stop downloading all the artifacts in the first place. We currently only have the option to use the url hints because we expose ourselves to each InstallRequirement during resolution via various hacks to pip's resolution process. With this proposal we'd basically be removing all of that logic and just grabbing hashes from the hashes listed at the json endpoint directly.

Using the simple index would require us to use the xmlrpc client which is not really a great idea.

Overall this is a pretty major change to our hashing logic. I appreciate your desire to move this forward and I do think it's possibly a good idea but I also want to convey that this is a serious modification to both our security model (which I didn't design and probably don't understand well enough to sign off on) and to our core resolution logic. We can't be haphazard about this, so it will take some time to consider how to move forward.

@pilkibun

This comment has been minimized.

Copy link
Author

commented Jul 15, 2019

No, I'm concerned about using hash hints at all, which is a fundamental shift in our implementation.

The current implementation has unreasonable lock times for many users. It needs changing if this is a problem you want to fix.

since the proposal is to stop downloading all the artifacts in the first place.

no, the proposal (which this PR now only partially implements) was to hash and cache the hash every artifact that is downloaded at download-time and use hash hints for what hasn't been installed (or tested) at lock time.

Overall this is a pretty major change to our hashing logic. I appreciate your desire to move this
forward and I do think it's possibly a good idea but I also want to convey that this is a serious
modification to both our security model (which I didn't design and probably don't understand well
enough to sign off on) and to our core resolution logic. We can't be haphazard about this, so
it will take some time to consider how to move forward.

You've convinced me that I won't be able to fix this with the level of effort I'm willing to put in. I've only been using pipenv for a few days and at this point the simplest choice for me is to move back to pip.

Thanks.

@pilkibun pilkibun closed this Jul 15, 2019

@techalchemy

This comment has been minimized.

Copy link
Member

commented Jul 16, 2019

The current implementation has unreasonable lock times for many users. It needs changing if this is a problem you want to fix.

We agree, the only thing anyone is uncertain about is the approach

no, the proposal (which this PR now only partially implements) was to hash and cache the hash every artifact that is downloaded at download-time, and use hash hints for what hasn't been installed (or tested) at lock time.

Right, that is restating the same point. We currently do download every artifact at lock time. Anything that modifies this behavior (i.e. using hash hints) is therefore inherently proposing to change that fact.

@ekhaydarov

This comment has been minimized.

Copy link

commented Aug 5, 2019

With @pilkibun on this. pipenv is dead. its been a half hour trying to install from lock file. Agree that sci packages may slow down installs but if you think of it in terms of time spent. Me spending a couple of hours resolving dependency issues once every 3-6 months or 30+ mins wasted on every ci/test/local build? its a no brainer to move back to pip

@andyneff

This comment has been minimized.

Copy link

commented Aug 6, 2019

install from lock file

@ekhaydarov consider not locking when you are just installing from lock file? I don't like ci/testing updating the packages on me, especially when those updates aren't going to be saved/tracked. So I find pipenv sync to be a better fit there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.