Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache Downloads by default #1732

Closed
dstufft opened this issue Apr 18, 2014 · 26 comments
Closed

Cache Downloads by default #1732

dstufft opened this issue Apr 18, 2014 · 26 comments
Labels
auto-locked Outdated issues that have been locked by automation

Comments

@dstufft
Copy link
Member

dstufft commented Apr 18, 2014

We should cache downloads by default, ideally in the standard system location for cache stuff. Perhaps a command to clear this cache would be useful too but this is 2014, hard drives are massive and the pip cache is generally small.

@dstufft dstufft mentioned this issue Apr 18, 2014
5 tasks
@alex
Copy link
Member

alex commented Apr 18, 2014

Does this include automatically having a local wheelhouse and using it, or is that really a seperate issue?

@dstufft
Copy link
Member Author

dstufft commented Apr 18, 2014

Seperate issue, this is basically "change a configuration option's default". The Wheelhouse thing is a little tricker because to really do it nicely pip will just automatically create a Wheel and cache it always. You can get a "ok" solution by changing the defaults for pip wheel and --find-links but that requires people to explicitly run pip wheel to cache something, which is less ideal than just doing it for them.

@alex
Copy link
Member

alex commented Apr 18, 2014

Sounds reasonable.

@qwcode
Copy link
Contributor

qwcode commented Apr 18, 2014

Does this include automatically having a local wheelhouse and using it
or is that really a seperate issue?

#1572 Internal Wheel cache

@dstufft dstufft added this to the Improve our User Experience milestone Apr 19, 2014
@carljm
Copy link
Contributor

carljm commented Apr 21, 2014

So the pip download cache is kind of broken for certain dependencies. For instance if you have a dependency like http://github.com/my/project/tarball/master (e.g. in a tox.ini), you almost certainly want the latest master tarball every time those dependencies are installed, you don't want pip to cache the master branch tarball at one point in time and then never see any future developments.

So this might just be a bug in the download cache; github does the right thing in this scenario, and that URL above redirects to a version-specific tarball URL. So maybe the fix is just that pip should not cache redirects (at least not 302s), but instead cache the final download URL (although this then means that the cache for a requirement like the above URL would then grow in size unboundedly). If so, I can just open a separate issue (maybe with PR) for don't-cache-redirects.

But I bring it up here because that issue was problematic enough for me that it caused me to turn off the download cache on all my machines, and until it's fixed I would personally not want to see the download cache turned on by default.

(Also, turning off the download cache didn't really make things noticeably slower back when I did it, because it turned out that Python distributions are mostly small enough that the finding and installing dwarfed the actual download time. But this may be changing nowadays with PEP 438 and wheels; it may be the download cache is a bigger win now than it used to be.)

@dstufft
Copy link
Member Author

dstufft commented Apr 21, 2014

So, something I've thought about before, is having pip rip out all of our custom cache code and just use https://github.com/ionrock/cachecontrol.

Pros:

  • Can (probably) just turn on Cache completely and remove the configuration options for it.
  • Respects Server headers, so it'll be more browser like and less surprising with less edge cases like the Github one.
  • Will cache more than just packages, but also the simple index and find-links pages.
  • Removes some of pip's code and relies on someone elses's code to handle it for us.
  • Understands Etag and Last-Modified headers, so will execute conditional requests if something falls out of the cache to make things more stream lined.
  • Solves concurrency related problems with our current caching solution.

Cons:

  • Cache will not be as efficient for packages, especially on a server that doesn't set CacheControl, Expires, Etag, or Last-Modified headers.
    • I think given that a lot of cases for a server will be behind nginx/apache2, that at least Last-Modified will be cached by default, so at the very least that should cause things to go from uncached to conditional requests, but still not as efficient as our current setup.
  • Two additional dependencies, CacheControl and lockfile which will need bundled.
  • Will probably need to write some wrapper code to handle invalidating our cache if the hash of the cached object doesn't match what we expect.
    • This should still be less code than our current caching solution.

So essentially we'll trade the "we'll always cache things even if we shouldn't" approach (and that fact that it "works around" the fact that some servers don't have cache headers) with a more correct approach that will also cache non packages things and which will still attempt to use a conditional request if the cached object has fallen out of the cache.

For myself I believe removing the download cache and switching to CacheControl for a standards respecting HTTP cache to be a good thing I think we should do it.

@pfmoore
Copy link
Member

pfmoore commented Apr 21, 2014

The main concern I have is that we're vendoring a lot of stuff these days. Is this really what we want to do? There's a risk involved in that the need to wait for an upstream fix makes us less responsive when fixing bugs (we've seen this with distlib).

@dstufft
Copy link
Member Author

dstufft commented Apr 21, 2014

I think so yes. The likelyhood that the CacheControl library does the "right" thing is a lot higher than our custom code. We could try reimplementing CacheControl internally, but I think that's ultimately a bad direction to go down. Worst case scenario if we end up needing a fix from upstream and they don't do it we can fork and rename, or move it from our _vendor directory into our normal namespace and declare it as an integrated part of pip now.

@dstufft
Copy link
Member Author

dstufft commented Apr 21, 2014

As an example of doing the "right" thing, pip currently completely screws up if more than one thing accesses the download cache directory, whereas CacheControl does the right thing in that circumstance.

@pfmoore
Copy link
Member

pfmoore commented Apr 21, 2014

Cool. Sounds fair, then.

@dstufft
Copy link
Member Author

dstufft commented Apr 21, 2014

So I've got a proof of concept working (and it seems to work well).

However there is one wrinkle that I'd like to discuss. Basically CacheControl works by hooking into requests and wrapping where requests does the actual "do an HTTP requests". When it's constructing a request it will read from it's cache, and if needed, will either skip the request (and return the cached response) or modify it to do a conditional request instead. Then when it's constructing the response to a request it will either, refresh the cache (based on a 304 response from the conditional request) or replace it with a new response.

The side effect of refreshing/setting the cache is that CacheControl has to consume the entire request (so that it has the content in order to actually cache it) which means that the entire file is downloaded prior to pip getting control of the Response object. This essentially means that it breaks our "display a download percentage" feature. I know how I can fix this so that it still works, (it's going to essentially involve being able to add a callback to the Response class so that consuming the content is what will trigger the download percentage itself, not us consuming the content and manually handling the download percentage.

However what I'm wondering here is do we need the download percentages at all anymore? Ideally with this code we will not be downloading files themselves very often anymore, and instead will be using cached copies of the files (either without making any HTTP requests, or by making a conditional HTTP request) so in a lot of circumstances the download %age will scroll by nearly instantly. In the cases where it doesn't scroll past nearly instantly hopefully the bulk of people will still download files fairly quickly due to Fastly on PyPI.

Since this question is about removing a fairly user visible feature, I want to get the opinion of @pypa/pip-developers.

@dstufft
Copy link
Member Author

dstufft commented Apr 21, 2014

Oh, also killing progress bars will probably enable us to switch to stdlib logging, or at the very least refactor our logging to be a lot cleaner since we no longer will need to deal with the difference between tty or not.

@agronholm
Copy link
Contributor

Why are progress bars involved with logging anyway?

@dstufft
Copy link
Member Author

dstufft commented Apr 21, 2014

Because pip.log isn't exactly logging, it's our user interface module that just happens to almost be logging except for progress bars and colors.

@qwcode
Copy link
Contributor

qwcode commented Apr 21, 2014

I could handle losing the progress meter. it's rarely holding on that.

@pfmoore
Copy link
Member

pfmoore commented Apr 21, 2014

I don't mind losing a progress bar for downloads (I don't recall I've ever seen it) but I do like (for instance) the dots that are printed as an install runs. Essentially, I'm not too bothered by the specifics of what we display, but we should give some indication of overall progress (and time to download typically isn't the biggest contributor here).

@dstufft
Copy link
Member Author

dstufft commented Apr 21, 2014

Dots?

$ pip install Django
Downloading/unpacking Django                             # This part here will say "20% X.YMB downloaded
  Downloading Django-1.6.2-py2.py3-none-any.whl (6.7MB): 6.7MB downloaded
Installing collected packages: Django
Successfully installed Django
Cleaning up...

I'm not sure what dots you mean?

Also it's possible the download progress bar doesn't work on Windows at all because it requires a tty.

@pfmoore
Copy link
Member

pfmoore commented Apr 21, 2014

Doh. I'm thinking of virtualenv. Yeah, quite possibly that's why I've never seen the progress bar. So I guess I'm OK with losing it :-)

@Ivoz
Copy link
Contributor

Ivoz commented Apr 21, 2014

IMHO logging has always been for goddamn logs. Displaying progress bars on stdout, is not, and never should have been, a part of "logging" in pip's (or any application's) context. It's just standard output informational messages to the user; a log might eventually report that the download either completed successfully or not.

So therefore I think we could "just as easily" (probably not in practice, but hopefully you get the point) switch to standard logging and keep progress bars if we really wanted.

Have you thought about using HEAD requests to only find the download url rather than the whole body? In my head I'm thinking we could use a hash present in the url (like is there currently for pypi) to indicate that X is pullable from the cache, because then you can exactly find whether a download has a particular hash and match it to a cache'd thing.

@dstufft
Copy link
Member Author

dstufft commented Apr 21, 2014

There's no reason to reimplement ETags like that.

@Ivoz
Copy link
Contributor

Ivoz commented Apr 21, 2014

If ETags work fine as a mean of uniquely identifying the thing we wish to download then absolutely it could be implemented on top of that instead.

@dstufft
Copy link
Member Author

dstufft commented Apr 21, 2014

I think i'm actually confused about what you're proposing :) Why would we use anything but the URL as the cache key?

@Ivoz
Copy link
Contributor

Ivoz commented Apr 22, 2014

Because hashes are like, defined as being a uniquely identifying key, that's their primary purpose in life!

@agronholm
Copy link
Contributor

I would prefer keeping the progress bars if it doesn't pose any insurmountable problems.
There are some rather large distributions (PySide?) with which I would really like to see how the download is going.

@dstufft
Copy link
Member Author

dstufft commented Apr 23, 2014

So, I've submitted some pull requests to CacheControl fixing a few problems I found while integrating it. There's one more PR left to land, but once it lands our existing code, the progress bar included, work without modification. The only thing that isn't working is when it's pulling from the cache it's not able to determine what the size of the file is for the output. I'm not sure why that is yet but I assume it should be a trivial fix in either pip or CacheControl.

For reference the remaining PR to CacheControl is: psf/cachecontrol#25

And this is what a diff looks like that adds cache support to pip: https://gist.github.com/dstufft/11213903

The patch hardcode in the OSX path, so they aren't viable for inclusion yet, but once we have an answer for #1733 it should be trivial to use that instead. It also doesn't attempt to remove either the download cache, or the page cache that we currently have in place so this would ideally be a diff that removes quite a bit of code.

@dstufft
Copy link
Member Author

dstufft commented Apr 24, 2014

A new diff that includes removing the download cache and in memory page cache: https://gist.github.com/dstufft/11237512

@lock lock bot added the auto-locked Outdated issues that have been locked by automation label Jun 5, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Jun 5, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
auto-locked Outdated issues that have been locked by automation
Projects
None yet
Development

No branches or pull requests

7 participants