Various binary cache improvements #34371

haampie · 2022-12-07T14:26:11Z

Over the last week(s) there's been quite some discussion about binary caches.

This issue is meant to give an overview of discussions, and previous & current problems, and a suggested way forward.

The main problems that triggered these discussions:

Requests to s3:// URLs are very slow, let's say it takes 1-3 seconds per request. Compare this to ~150ms for the equivalent https:// URL for our spack-binaries bucket.
Multiple requests per mirror were issued to locate a spec: spec.yaml, spec.json, spec.json.sig
There are typically multiple mirrors configured, like 3-5.

This leads to a significant overhead to fetch binaries in CI, @blue42u reported a lot of time wasted (let's say about 30 minutes) just trying to fetch binaries from mirrors.

We already had a small optimization to reduce the number of requests:

We check the local or offline cache of the mirror for a spec, and then prioritize mirrors with matching specs, resulting in direct hits (typically).

However, the optimization has a bug: if the spec cannot be located in any local cache (which is either because none of the remotes have the spec at all, or because we don't have a local cache for the mirror), Spack would do a partial update of the cache. Partial in the sense that it would query each mirror if it had the spec by directly fetching the relevant spec.json files. So, in this case, the optimization is doing strictly damage: all mirrors are queried before starting a download; without the "optimization", Spack would simply stop at the first mirror where it can download from.

@blue42u changed the optimization in point 4. to go from a partial update to a full update of the cache if there was no local cache hit. The idea being, for the next spec to install, the cache can finally be exploited, and the optimization works: mirrors are ordered to get direct hits.

However, 5. only makes sense for terribly slow mirrors, since there is a high startup cost of fetching index.json with say a 100K specs. Slow mirrors are not the norm (e.g. file:// mirrors or mirrors on the local network with low latency), so this PR makes the Spack experience only worse for other Spack users. For fast mirrors, we'd really like to do direct fetches (and also use that fully offline mirror order optimization).

In fact, we never had any issues with the https://mirror.spack.io URLs for sources, it would be absurd if Spack would first download an index of all sources available on mirror.spack.io so that it could use that when installing packages from sources.

What has not really been looked into is why these s3:// requests are so slow in the first place, and it turns out it's because of various trivial issues:

Each s3:// 404 would cause Spack to try and download <failing url>/index.html, this was fixed in Stop checking for {s3://path}/index.html #34325
Spack creates an S3 client instance on each request, which itself requires one or more requests to S3 if no credentials are provided, causing a huge overhead. For me, reusing the same client instance makes requests 4x faster. s3: cache client instance #34372

Next, what had not really been addressed:

Direct existence checks on mirrors are slower than necessary, because in the case of a cache miss three requests are made: spec.yaml, spec.json, spec.json.sig, we can reduce this to one:
- spec.yaml was deprecated, so it's removed in remove legcay yaml from buildcache fetch #34347
- There is no technical reason to have a special spec.json.sig extension, so we can just stick to spec.json and have Spack peek into the file to see if it's signed or not, so I submitted binary cache: do not create separate spec.json.sig files #34350. (The only problem here is that it's not forward compatible, it may need backporting to 0.19 if we're nice about it).

When 6-8 are all addressed, I expect it would reduce the overhead (especially in the unhappy cache miss path) at least by a factor 10.

Going forward, I think the highest priority is to fix point 7.

Then we should ensure the mirror order optimization is always offline, which means partially reverting @blue42u's PR, and adding index_only=True in the relevant place where a spec is searched for.

To make @blue42u happy, it could be useful to have a command spack mirror update (or something like that), that updates the local binary index if necessary, which he can then run before spack install in CI.

The text was updated successfully, but these errors were encountered:

blue42u · 2022-12-07T15:38:11Z

However, 5. only makes sense for terribly slow mirrors, since there is a high startup cost of fetching index.json with say a 100K specs. Slow mirrors are not the norm (e.g. file:// mirrors or mirrors on the local network with low latency), so this PR makes the Spack experience only worse for other Spack users. For fast mirrors, we'd really like to do direct fetches (and also use that fully offline mirror order optimization).

Note that Spack end-users generally have a .spack/cache directory that is not wiped frequently (spack clean is only used when needed). In this case the startup cost is only paid once, and the cost to update should be low as only need to fetch index.json.hash. This was not strictly true due to a bug that will be fixed by either of the following PRs:

Fix url_exists to use HEAD instead of GET for s3:// URLs: Use urllib handler for s3:// and gs://, improve url_exists through HEAD requests #34324
Remove the call to url_exists in favor of a later failure during the actual GET: BinaryIndexCache: Work in parallel #34360

Note also that Spack end-users almost always concretize (more than) once in their normal operations. This already requires the startup cost to be paid to fully implement buildcache --reuse semantics.

For the above users, paying the startup cost and using the mirror order optimization provides overall benefits which can be significant with moderate-large specs, multiple mirrors or high-latency "slow" mirrors.

The only "other Spack users" for which paying this startup cost causes overall slowdowns are users who (a) don't have a persistent cache and (b) don't concretize during their operations. The known (and likely most common) case of this is Spack in CI, where clean-room containers are the norm. Specifically, the build jobs generated by spack ci generate fall into this case, which in turn affects Spack's own CI.

The great irony here is that I implemented #32137 to speed up our own use of Spack in CI. And we reaped benefits, because for us 12 requests * 50 packages was significantly more expensive than fetching the index.json once for 3 mirrors. This indicates to me that Spack's CI is different in some way that needs to be investigated.

However, since the worst-case scenario is indeed worse now that #32137 has landed, I have given my blessing to #34326 which reverts the behavior it adds that causes problems for Spack's CI.

More of what has not really been addressed:

The index.json format used is very inefficient. Using numbers calculated based on the current 172MB index.json for the develop public buildcache (ref):
- The format can be losslessly reduced by over 20x, by removing whitespace ("minifying") and compressing with gzip.
- The optimization referenced here does not require the full Spec objects to implement. If only the top-level Spec information is extracted from the index.json, the resulting JSON is 4x smaller. Minifying and compressing this "contents.json" results in a file >78x smaller than the current format.
Algorithmically, a whole spack ci pipeline before Update the binary index before attempting direct fetches #32137 took in total O(#mirrors * #package-nodes^2) requests to complete, none of which are for index.json. It currently takes O(#mirrors*#package-nodes + #package-nodes^2) requests, many of which are index.json.

One could imagine reducing this to O(#package-nodes^2) by planning ahead-of-time in spack ci generate which mirror each package will be fetched from. This would be much faster than the pre-Update the binary index before attempting direct fetches #32137 case when request latency > network bandwidth.
spack ci pipelines could have a (semi-)persistent index cache, by using the cache: clause provided by GitLab. This could be implemented once the following issues are resolved:
- The cache: key is not supported in the current spack ci configuration, this and related issues are fixed in spack ci: Refine job attribute configuration #32300
  - See also an alternative idea with other key features CI boilerplate reduction #34272 (comment)
- The location of the index-cache cannot be configured separately from the rest of the misc_cache. While the index-cache could in theory persist across Spack versions, other parts of the misc_cache cannot be.

scottwittenburg · 2022-12-08T23:31:40Z

To make @blue42u happy, it could be useful to have a command spack mirror update (or something like that), that updates the local binary index if necessary, which he can then run before spack install in CI.

I was thinking something like this too. What (I think) that will also require is some way to get spack to run bindist.update() in the process that is doing spack install, or else spack won't ingest its locally cached indices. Calling bindist.update() will result in attempts to fetch the index.json.hash for any cached indices at the moment, but I guess shouldn't be a show-stopper.

vsoch · 2023-01-07T21:42:38Z

To add a note here - it used to work to add a filesystem mirror without file:// - just discovered this no longer works and I have to explicitly add it.

haampie added the feature label Dec 7, 2022

haampie assigned haampie, zackgalbreath, scheibelp and scottwittenburg Dec 7, 2022

haampie mentioned this issue Jan 13, 2023

Refer to mirrors by name, path, or url #34891

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Various binary cache improvements #34371

Various binary cache improvements #34371

haampie commented Dec 7, 2022 •

edited

blue42u commented Dec 7, 2022 •

edited

scottwittenburg commented Dec 8, 2022

vsoch commented Jan 7, 2023

Various binary cache improvements #34371

Various binary cache improvements #34371

Comments

haampie commented Dec 7, 2022 • edited

blue42u commented Dec 7, 2022 • edited

scottwittenburg commented Dec 8, 2022

vsoch commented Jan 7, 2023

haampie commented Dec 7, 2022 •

edited

blue42u commented Dec 7, 2022 •

edited