Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Locally Cache Resolved Packages #1892

Closed
techalchemy opened this issue Apr 2, 2018 · 9 comments
Closed

Locally Cache Resolved Packages #1892

techalchemy opened this issue Apr 2, 2018 · 9 comments
Labels
Category: Dependency Resolution Issue relates to dependency resolution. Category: Future Issue is planned for the future. Type: Behavior Change This issue describes a behavior change. Type: Discussion This issue is open for discussion. Type: Enhancement 💡 This is a feature or enhancement request.
Milestone

Comments

@techalchemy
Copy link
Member

As part of an effort to speed up the locking process, I think we can make some incremental gains by storing a local cache of the dependency graph of each package we resolve for each abi tag + python version.

Essentially I would suggest:

  • Flat cache storing 1 package with only one level of nesting
  • Each package is a dict-like of containing the name of the package and an iterable of resolved versions
    • Each resolved version stores the following:
      • Hashes
      • ABI tag resolved against (platform/python version/compile flags)
      • dependencies -> dict-like with keys ['name', 'resolved_version', 'specifiers', 'markers']
  • Some implementation of this which could be stored in the lockfile itself to make it portable & avoid re-resolving known graphs

This saves us having to continuously hit pypi for information about things that are not going to change, and allows us to re-resolve and re-lock without re-calculating each setup file over and over again which is incredibly slow. While this may come at the risk of issues of indeterminism in setup.py execution, I think the risk is minimal and the gains are vast.

Reference issues: #1785 #1866

/cc @ncoghlan @uranusjr @jtratner

@techalchemy techalchemy added Type: Enhancement 💡 This is a feature or enhancement request. Type: Discussion This issue is open for discussion. Category: Dependency Resolution Issue relates to dependency resolution. labels Apr 2, 2018
@uranusjr
Copy link
Member

uranusjr commented Apr 3, 2018

I assume each cache file represents a (package + tags), maybe named with a scheme similar to PEP 427? If that is the case, each cache file would contain direct dependency information of that (package + tags) combination.

@ncoghlan
Copy link
Member

ncoghlan commented Apr 3, 2018

+1 from me. I don't think the cache will make any current issues with indeterminism worse, and in some ways is likely to mitigate them - rerunning a lock on the same machine when there haven't been any new releases will always give the same answer, for example, even if some of the setup.py files involved are non-deterministic.

@techalchemy
Copy link
Member Author

What are the tradeoffs when distributing the cache across multiple files, is there anywhere in particular we should look for inspiration on the performance front? Is there an ideal format?

@ncoghlan
Copy link
Member

ncoghlan commented Apr 3, 2018

The core trade-off I'm aware is that using multiple files makes it much easier to avoid conflicts when multiple processes are writing to the cache at the same time, but also uses up more inodes on the filesystem (which is typically only a potential problem on actual file servers, not development workstations).

There are also differences in how the two models interact with filesystem caching in general, as well as what happens when the cache is actually stored on a network drive.

Unfortunately, I've never had much luck looking for actual guides on this, since my search results always get swamped by links related to regular application caching.

However, one use-case-specific consideration here is that it would be nice if the pipenv artifact metadata caches used a similar layout to pip's actual artifact caches, so if you know how to navigate one, you know how to navigate the other. That would push towards a multi-file storage layout.

@jtratner
Copy link
Collaborator

jtratner commented Apr 6, 2018

Caching more things would definitely be helpful (and I like the idea!) 👍

I had a couple thoughts:

A couple points in favor of multiple files:

  1. It's easier to reason about
  2. there's a really nice parallel if the naming scheme for wheels matched the naming scheme for the cache, in terms of handling all edge cases.
  3. it's easier to parallelize, which seems to be key direction for performance gains

(but you could still go pull the cache together into one big file after the fact :) )

In terms of the lockfile, it'd be neat to be able to bias towards keeping packages as similar as possible.

This saves us having to continuously hit pypi for information about things that are not going to change, and allows us to re-resolve and re-lock without re-calculating each setup file over and over again which is incredibly slow.

Do we know which things really aren't going to change? For example, you can upload a new build with a different build number (for whatever reason, tho I assume it's in the case of a bad compilation or a bad setup.py), so it's not necessarily the case that asking for the same dependency range resolves to the same set of options. Perhaps it'd be better to key on URL-with-hash and still require at least one listing of PyPI to figure out whether package lists have changed when attempting to lock again.

In the same vein, I'm not sure you can cache all resolved versions of a package, because you'd still need to hit PyPI to find out if new packages are available.

However, one use-case-specific consideration here is that it would be nice if the pipenv artifact metadata caches used a similar layout to pip's actual artifact caches, so if you know how to navigate one, you know how to navigate the other. That would push towards a multi-file storage layout.

I tried reusing pip's caches in another PR and the internal API is simple, but the actual file layout is somewhat opaque (it looks like it's just the sha256 split into a couple pieces to reduce number of files in a specific directory). Might be nice if the final component of the directory tree included some human-readable name too.

@ncoghlan
Copy link
Member

ncoghlan commented Apr 6, 2018

The key mapping that we want to cache is "Which dependencies does version X of project Y declare?", since that isn't a simple API query for Python packages - it requires downloading artifacts and looking inside them.

However, the answer to that question is going to remain stable once calculated for a published artifact, since PyPI itself prohibits changing previously published artifacts. (Technically other index servers can allow artifact replacement, but I'm OK with viewing "We replaced a non-development artifact in our internal index server and changed its dependency declarations without bumping the version number" as a self-inflicted problem that folks will need to resolve by purging any affected caches).

Caching dependency resolution is much harder (beyond what the lock file does), since that's an inherently time dependent question - new feature and maintenance releases may be published at any time.

@techalchemy
Copy link
Member Author

Right, I don't think there is any way to cache all versions of anything, or even to know that we have cached the current version. The only thing I want to cache is that we have resolved <dependency set> for version x.y.z from a non-local source (or possibly specifically from pypi? It's just that if we can maintain that cache we can, as @ncoghlan mentioned earlier, check whether we would be resolving a different version than we have cached, and, if not, just use the cached dependency graph.

@ncoghlan
Copy link
Member

ncoghlan commented Apr 7, 2018

We don't want caching proxies like devpi/Artifactory/Nexus/Pulp et al to make pipenv dramatically slower again, so I think the caching should be active for all artifact sources by default, (perhaps with a source level "cache_metadata: no" option). The fact the cache key is based on full URLs should keep the cached metadata from colliding across different index servers.

@techalchemy techalchemy added this to the 11.11.0 milestone Apr 29, 2018
@techalchemy techalchemy added Category: Future Issue is planned for the future. Type: Behavior Change This issue describes a behavior change. labels Apr 29, 2018
@matteius
Copy link
Member

matteius commented Jul 5, 2022

The years is 2022 and pipenv caches packages, so I am not sure if there is a subtle difference between what pipenv does today with caching and what was requested in 2018, but if so this is best addressed with a new issue report. Noting also that if you want to clear the pipenv cache, run pipenv --clear.

@matteius matteius closed this as completed Jul 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Category: Dependency Resolution Issue relates to dependency resolution. Category: Future Issue is planned for the future. Type: Behavior Change This issue describes a behavior change. Type: Discussion This issue is open for discussion. Type: Enhancement 💡 This is a feature or enhancement request.
Projects
None yet
Development

No branches or pull requests

5 participants