Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various binary cache improvements #34371

Open
haampie opened this issue Dec 7, 2022 · 3 comments
Open

Various binary cache improvements #34371

haampie opened this issue Dec 7, 2022 · 3 comments
Assignees
Labels

Comments

@haampie
Copy link
Member

haampie commented Dec 7, 2022

Over the last week(s) there's been quite some discussion about binary caches.

This issue is meant to give an overview of discussions, and previous & current problems, and a suggested way forward.


The main problems that triggered these discussions:

  1. Requests to s3:// URLs are very slow, let's say it takes 1-3 seconds per request. Compare this to ~150ms for the equivalent https:// URL for our spack-binaries bucket.
  2. Multiple requests per mirror were issued to locate a spec: spec.yaml, spec.json, spec.json.sig
  3. There are typically multiple mirrors configured, like 3-5.

This leads to a significant overhead to fetch binaries in CI, @blue42u reported a lot of time wasted (let's say about 30 minutes) just trying to fetch binaries from mirrors.


We already had a small optimization to reduce the number of requests:

  1. We check the local or offline cache of the mirror for a spec, and then prioritize mirrors with matching specs, resulting in direct hits (typically).

However, the optimization has a bug: if the spec cannot be located in any local cache (which is either because none of the remotes have the spec at all, or because we don't have a local cache for the mirror), Spack would do a partial update of the cache. Partial in the sense that it would query each mirror if it had the spec by directly fetching the relevant spec.json files. So, in this case, the optimization is doing strictly damage: all mirrors are queried before starting a download; without the "optimization", Spack would simply stop at the first mirror where it can download from.


  1. @blue42u changed the optimization in point 4. to go from a partial update to a full update of the cache if there was no local cache hit. The idea being, for the next spec to install, the cache can finally be exploited, and the optimization works: mirrors are ordered to get direct hits.

However, 5. only makes sense for terribly slow mirrors, since there is a high startup cost of fetching index.json with say a 100K specs. Slow mirrors are not the norm (e.g. file:// mirrors or mirrors on the local network with low latency), so this PR makes the Spack experience only worse for other Spack users. For fast mirrors, we'd really like to do direct fetches (and also use that fully offline mirror order optimization).

In fact, we never had any issues with the https://mirror.spack.io URLs for sources, it would be absurd if Spack would first download an index of all sources available on mirror.spack.io so that it could use that when installing packages from sources.


What has not really been looked into is why these s3:// requests are so slow in the first place, and it turns out it's because of various trivial issues:

  1. Each s3:// 404 would cause Spack to try and download <failing url>/index.html, this was fixed in Stop checking for {s3://path}/index.html #34325
  2. Spack creates an S3 client instance on each request, which itself requires one or more requests to S3 if no credentials are provided, causing a huge overhead. For me, reusing the same client instance makes requests 4x faster. s3: cache client instance #34372

Next, what had not really been addressed:

  1. Direct existence checks on mirrors are slower than necessary, because in the case of a cache miss three requests are made: spec.yaml, spec.json, spec.json.sig, we can reduce this to one:

When 6-8 are all addressed, I expect it would reduce the overhead (especially in the unhappy cache miss path) at least by a factor 10.


Going forward, I think the highest priority is to fix point 7.

Then we should ensure the mirror order optimization is always offline, which means partially reverting @blue42u's PR, and adding index_only=True in the relevant place where a spec is searched for.

To make @blue42u happy, it could be useful to have a command spack mirror update (or something like that), that updates the local binary index if necessary, which he can then run before spack install in CI.

@blue42u
Copy link
Contributor

blue42u commented Dec 7, 2022

However, 5. only makes sense for terribly slow mirrors, since there is a high startup cost of fetching index.json with say a 100K specs. Slow mirrors are not the norm (e.g. file:// mirrors or mirrors on the local network with low latency), so this PR makes the Spack experience only worse for other Spack users. For fast mirrors, we'd really like to do direct fetches (and also use that fully offline mirror order optimization).

Note that Spack end-users generally have a .spack/cache directory that is not wiped frequently (spack clean is only used when needed). In this case the startup cost is only paid once, and the cost to update should be low as only need to fetch index.json.hash. This was not strictly true due to a bug that will be fixed by either of the following PRs:

Note also that Spack end-users almost always concretize (more than) once in their normal operations. This already requires the startup cost to be paid to fully implement buildcache --reuse semantics.

For the above users, paying the startup cost and using the mirror order optimization provides overall benefits which can be significant with moderate-large specs, multiple mirrors or high-latency "slow" mirrors.

The only "other Spack users" for which paying this startup cost causes overall slowdowns are users who (a) don't have a persistent cache and (b) don't concretize during their operations. The known (and likely most common) case of this is Spack in CI, where clean-room containers are the norm. Specifically, the build jobs generated by spack ci generate fall into this case, which in turn affects Spack's own CI.

The great irony here is that I implemented #32137 to speed up our own use of Spack in CI. And we reaped benefits, because for us 12 requests * 50 packages was significantly more expensive than fetching the index.json once for 3 mirrors. This indicates to me that Spack's CI is different in some way that needs to be investigated.

However, since the worst-case scenario is indeed worse now that #32137 has landed, I have given my blessing to #34326 which reverts the behavior it adds that causes problems for Spack's CI.


More of what has not really been addressed:

  1. The index.json format used is very inefficient. Using numbers calculated based on the current 172MB index.json for the develop public buildcache (ref):

    • The format can be losslessly reduced by over 20x, by removing whitespace ("minifying") and compressing with gzip.
    • The optimization referenced here does not require the full Spec objects to implement. If only the top-level Spec information is extracted from the index.json, the resulting JSON is 4x smaller. Minifying and compressing this "contents.json" results in a file >78x smaller than the current format.
  2. Algorithmically, a whole spack ci pipeline before Update the binary index before attempting direct fetches #32137 took in total O(#mirrors * #package-nodes^2) requests to complete, none of which are for index.json. It currently takes O(#mirrors*#package-nodes + #package-nodes^2) requests, many of which are index.json.

    One could imagine reducing this to O(#package-nodes^2) by planning ahead-of-time in spack ci generate which mirror each package will be fetched from. This would be much faster than the pre-Update the binary index before attempting direct fetches #32137 case when request latency > network bandwidth.

  3. spack ci pipelines could have a (semi-)persistent index cache, by using the cache: clause provided by GitLab. This could be implemented once the following issues are resolved:

@scottwittenburg
Copy link
Contributor

To make @blue42u happy, it could be useful to have a command spack mirror update (or something like that), that updates the local binary index if necessary, which he can then run before spack install in CI.

I was thinking something like this too. What (I think) that will also require is some way to get spack to run bindist.update() in the process that is doing spack install, or else spack won't ingest its locally cached indices. Calling bindist.update() will result in attempts to fetch the index.json.hash for any cached indices at the moment, but I guess shouldn't be a show-stopper.

@vsoch
Copy link
Member

vsoch commented Jan 7, 2023

To add a note here - it used to work to add a filesystem mirror without file:// - just discovered this no longer works and I have to explicitly add it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants