-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache Downloads by default #1732
Comments
Does this include automatically having a local wheelhouse and using it, or is that really a seperate issue? |
Seperate issue, this is basically "change a configuration option's default". The Wheelhouse thing is a little tricker because to really do it nicely pip will just automatically create a Wheel and cache it always. You can get a "ok" solution by changing the defaults for |
Sounds reasonable. |
#1572 Internal Wheel cache |
So the pip download cache is kind of broken for certain dependencies. For instance if you have a dependency like So this might just be a bug in the download cache; github does the right thing in this scenario, and that URL above redirects to a version-specific tarball URL. So maybe the fix is just that pip should not cache redirects (at least not 302s), but instead cache the final download URL (although this then means that the cache for a requirement like the above URL would then grow in size unboundedly). If so, I can just open a separate issue (maybe with PR) for don't-cache-redirects. But I bring it up here because that issue was problematic enough for me that it caused me to turn off the download cache on all my machines, and until it's fixed I would personally not want to see the download cache turned on by default. (Also, turning off the download cache didn't really make things noticeably slower back when I did it, because it turned out that Python distributions are mostly small enough that the finding and installing dwarfed the actual download time. But this may be changing nowadays with PEP 438 and wheels; it may be the download cache is a bigger win now than it used to be.) |
So, something I've thought about before, is having pip rip out all of our custom cache code and just use https://github.com/ionrock/cachecontrol. Pros:
Cons:
So essentially we'll trade the "we'll always cache things even if we shouldn't" approach (and that fact that it "works around" the fact that some servers don't have cache headers) with a more correct approach that will also cache non packages things and which will still attempt to use a conditional request if the cached object has fallen out of the cache. For myself I believe removing the download cache and switching to CacheControl for a standards respecting HTTP cache to be a good thing I think we should do it. |
The main concern I have is that we're vendoring a lot of stuff these days. Is this really what we want to do? There's a risk involved in that the need to wait for an upstream fix makes us less responsive when fixing bugs (we've seen this with distlib). |
I think so yes. The likelyhood that the CacheControl library does the "right" thing is a lot higher than our custom code. We could try reimplementing CacheControl internally, but I think that's ultimately a bad direction to go down. Worst case scenario if we end up needing a fix from upstream and they don't do it we can fork and rename, or move it from our |
As an example of doing the "right" thing, pip currently completely screws up if more than one thing accesses the download cache directory, whereas CacheControl does the right thing in that circumstance. |
Cool. Sounds fair, then. |
So I've got a proof of concept working (and it seems to work well). However there is one wrinkle that I'd like to discuss. Basically CacheControl works by hooking into requests and wrapping where requests does the actual "do an HTTP requests". When it's constructing a request it will read from it's cache, and if needed, will either skip the request (and return the cached response) or modify it to do a conditional request instead. Then when it's constructing the response to a request it will either, refresh the cache (based on a 304 response from the conditional request) or replace it with a new response. The side effect of refreshing/setting the cache is that CacheControl has to consume the entire request (so that it has the content in order to actually cache it) which means that the entire file is downloaded prior to pip getting control of the However what I'm wondering here is do we need the download percentages at all anymore? Ideally with this code we will not be downloading files themselves very often anymore, and instead will be using cached copies of the files (either without making any HTTP requests, or by making a conditional HTTP request) so in a lot of circumstances the download %age will scroll by nearly instantly. In the cases where it doesn't scroll past nearly instantly hopefully the bulk of people will still download files fairly quickly due to Fastly on PyPI. Since this question is about removing a fairly user visible feature, I want to get the opinion of @pypa/pip-developers. |
Oh, also killing progress bars will probably enable us to switch to stdlib logging, or at the very least refactor our logging to be a lot cleaner since we no longer will need to deal with the difference between tty or not. |
Why are progress bars involved with logging anyway? |
Because |
I could handle losing the progress meter. it's rarely holding on that. |
I don't mind losing a progress bar for downloads (I don't recall I've ever seen it) but I do like (for instance) the dots that are printed as an install runs. Essentially, I'm not too bothered by the specifics of what we display, but we should give some indication of overall progress (and time to download typically isn't the biggest contributor here). |
Dots? $ pip install Django
Downloading/unpacking Django # This part here will say "20% X.YMB downloaded
Downloading Django-1.6.2-py2.py3-none-any.whl (6.7MB): 6.7MB downloaded
Installing collected packages: Django
Successfully installed Django
Cleaning up... I'm not sure what dots you mean? Also it's possible the download progress bar doesn't work on Windows at all because it requires a tty. |
Doh. I'm thinking of virtualenv. Yeah, quite possibly that's why I've never seen the progress bar. So I guess I'm OK with losing it :-) |
IMHO logging has always been for goddamn logs. Displaying progress bars on stdout, is not, and never should have been, a part of "logging" in pip's (or any application's) context. It's just standard output informational messages to the user; a log might eventually report that the download either completed successfully or not. So therefore I think we could "just as easily" (probably not in practice, but hopefully you get the point) switch to standard logging and keep progress bars if we really wanted. Have you thought about using HEAD requests to only find the download url rather than the whole body? In my head I'm thinking we could use a hash present in the url (like is there currently for pypi) to indicate that X is pullable from the cache, because then you can exactly find whether a download has a particular hash and match it to a cache'd thing. |
There's no reason to reimplement ETags like that. |
If ETags work fine as a mean of uniquely identifying the thing we wish to download then absolutely it could be implemented on top of that instead. |
I think i'm actually confused about what you're proposing :) Why would we use anything but the URL as the cache key? |
Because hashes are like, defined as being a uniquely identifying key, that's their primary purpose in life! |
I would prefer keeping the progress bars if it doesn't pose any insurmountable problems. |
So, I've submitted some pull requests to CacheControl fixing a few problems I found while integrating it. There's one more PR left to land, but once it lands our existing code, the progress bar included, work without modification. The only thing that isn't working is when it's pulling from the cache it's not able to determine what the size of the file is for the output. I'm not sure why that is yet but I assume it should be a trivial fix in either pip or CacheControl. For reference the remaining PR to CacheControl is: psf/cachecontrol#25 And this is what a diff looks like that adds cache support to pip: https://gist.github.com/dstufft/11213903 The patch hardcode in the OSX path, so they aren't viable for inclusion yet, but once we have an answer for #1733 it should be trivial to use that instead. It also doesn't attempt to remove either the download cache, or the page cache that we currently have in place so this would ideally be a diff that removes quite a bit of code. |
A new diff that includes removing the download cache and in memory page cache: https://gist.github.com/dstufft/11237512 |
We should cache downloads by default, ideally in the standard system location for cache stuff. Perhaps a command to clear this cache would be useful too but this is 2014, hard drives are massive and the pip cache is generally small.
The text was updated successfully, but these errors were encountered: