New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update the binary index before attempting direct fetches #32137
Conversation
b6e93e0
to
8acdc0b
Compare
bab9286
to
d6e740c
Compare
This looks like it would trigger a large number of attempts at updating the cache (especially if many packages in a DAG aren't kept there), which could itself be expensive. Presumably if you've checked "recently" (e.g. in the last 10 minutes) then there shouldn't be a need to re-check. Do you think it would be reasonable to add a mechanism to limit this? |
Good point, it's only the cost of fetching the I think a rate limiting mechanism is reasonable, but I'm a little concerned a time-based mechanism would run into race conditions in tight CI pipelines. Would a configurable time-based limit (e.g. one fetch per 10 min) plus a limit of one fetch per I should have time to implement this over the weekend. |
Ah that's true, I should definitely think more on that before asking: it's not much data (I forgot there's a hash for the index and thought it would have to constantly re-retrieve it) and it's just one request per mirror per package install (which it would be doing anyway if the package were there). I'll think on it for a day. |
Follow up: I think there should be a timer on retries but it does not have to persist between Spack invocations (i.e. it can be a property of However, looking through |
👍 but why have the timer then? I would think any reasonable timer would be far longer than most Spack commands use of the
I'm the wrong person to ask for an opinion, I'm just hacking on the code to make it do what I want. 😅 But now that you made me look... I'm not sure where IMHO, if you want "conceptual consistency," I think the functions in So I guess my opinion would be that embedding the refresh in a getter method makes the most sense for this PR. (It's also a very common pattern for lazily-filled proxy objects, |
d6e740c
to
a719c9d
Compare
Updated with a configurable TTL for the in-memory cache of binary indices, default 10 minutes. |
When running |
@scheibelp I agree, but let me confirm the exact behavior you're looking for. I am thinking of two possibilities that fit so far:
Both cases give the freshness guarantee that every |
I just want to chime in on the original intent with binary index caching vs direct fetching, since I had something to do with it. In spack, there's no requirement that a binary index is up to date, or even exists. Or at least that was true when I was implementing the first iteration of the binary cache index. Also, for a spack installation with a single mirror, it was really fast to install a particular spec from binary by using the direct fetch approach, which I wanted to preserve. Requiring fetching/incorporating the index first seemed to slow that down quite a lot at the time. I think assuming an up to date index is getting more common, and maybe it will eventually be a requirement/invariant, but I think that's some ways off in the future. If that were to happen though, I could imagine dispensing with the "direct fetching" altogether. But at the moment, in our gitlab ci pipelines (where each job may install a lot of binary dependencies), we don't actually update the index at the end of the job, due to the high likelihood of many parallel jobs attempting to update it at the same time. So in that situation, fetching the index at the start of every install is unlikely to prevent the need to fall back to the direct fetch approach, and fetching the index will just slow the pipeline down more. It's possible I'm missing something or have forgotten a key detail or two, please feel free to point that out. Also, I'm intentionally not approving or requesting changes at this point, just trying to involve myself in the conversation. I do want to say thanks to @blue42u though, it's great to have someone actively thinking about and contributing to the pipeline effort! 🚀 🙏 Please keep stirring up the waters! |
I've seen that slowdown myself too with just When I looked into it (weeks ago now) it seemed that most of this slowdown was related to "lifting" the entire JSON to |
a719c9d
to
808c3fa
Compare
808c3fa
to
53da4e6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have one request for schema documentation and one request to make the update cooldown behavior optional. Let me know if either is confusing or seems wrong.
Co-authored-by: Tamara Dahlgren <35777542+tldahlgren@users.noreply.github.com>
To allow for index re-fetches in sufficiently long-running Spack commands
53da4e6
to
b2f0f37
Compare
I just noticed when trying to reproduce a pipeline generation job with
I wonder if it's because the default value of the Partial log showing the behavior attached below. |
@scottwittenburg is the CI invoking separate With this PR, you should not see more messages of that form (especially many more) unless running As it is, this PR:
If in this context a single |
Well, it's not |
Looking back over this, I think the cooldown isn't activating in the case of a failed fetch, which causes errors like when when I'll whip up a fix quick, one moment... |
So am I wrong that this call site needs updated to pass |
I think so, as the first call to But please feel free to test what I think is the fix: #33509 |
#32137 added an option to update() a BinaryCacheIndex with a cooldown: repeated attempts within this cooldown would not actually retry. However, the cooldown was not properly tracked for failures (which is common when the mirror does not store any binaries and therefore has no index.json). This commit ensures that update(..., with_cooldown=True) will also skip the update even if a failure has occurred within the cooldown period.
spack#32137 added an option to update() a BinaryCacheIndex with a cooldown: repeated attempts within this cooldown would not actually retry. However, the cooldown was not properly tracked for failures (which is common when the mirror does not store any binaries and therefore has no index.json). This commit ensures that update(..., with_cooldown=True) will also skip the update even if a failure has occurred within the cooldown period.
spack#32137 added an option to update() a BinaryCacheIndex with a cooldown: repeated attempts within this cooldown would not actually retry. However, the cooldown was not properly tracked for failures (which is common when the mirror does not store any binaries and therefore has no index.json). This commit ensures that update(..., with_cooldown=True) will also skip the update even if a failure has occurred within the cooldown period.
The fact that S3 has slow response time should not mean that we should always download a full index. That's primarily an AWS issue. The index is 80MB of json right now, and it lets you fetch zlib which is a few KB :/ If you have a fast HTTP cache server, and the connection stays alive, you can get response times of a few milliseconds. On top of that this PR partly causes a rather annoying issue we see in our CI: if index.json.hash is out of sync/unavailable, this causes Spack install to download it every time it's called, because either it doesn't get cached, or the remote content hash is different. Combine that with parallel install and you get tons of redundant re-indexing. A lot of time goes into building a database out of a fetched index.json (even on a slower download constructing the index in memory takes most of the time). |
The {
"23cpnmxeoa44edrzxlwm6pjz6ibo6xg7": {
"name": "gnuconfig",
"version": "2021-08-14",
"arch": {
"platform": "linux",
"platform_os": "amzn2",
"target": "aarch64"
},
"compiler": {
"name": "gcc",
"version": "7.3.1"
}
},
...
} One could also imagine "minifying" either JSON file to remove unneeded whitespace, or gzip-compressing it for fast de/compression via the Python standard library. Some quick preliminary results:
Also, note that In conclusion, the fact that Spack has to fetch the full index is primarily because Spack is dumb and always loads a full index, not because AWS is slow responding with 404. Which wouldn't be an issue at all if Spack didn't cause 404s like they're going out of style (and yes, AFAICT it still does). #BlameSpack |
I would be very surprised if that was not already done automatically.
It is slow to download due to throttling, going well below 1MB/s at times.
Spack constructs a Still I don't see the point of downloading a full index of |
Let me just re-iterate this point: |
Surprise, it's not. (Verified with Wireshark.)
Then I ask, why is Spack constructing a full
Before this PR It is true that the bytes downloaded have increased from To achieve the
Given the Spack public buildcache is hosted on AWS, I would think this is indeed the most important case to the Spack team. I admit this PR is incomplete (for many reasons), but it does provide significant benefits to us and I'd much rather implement an argument to |
Sigh. Of course it doesn't :(
O(m) is fine. Another solution is to kick off multiple requests and let the first 200 response win, which is O(m/p). And as a side effect, the fastest response is typically from the mirror with the best download speed.
Sure, but the solution is to fix the AWS related issue. I'm repeating myself: downloading the full index is not the solution, it's not scalable, it should not be standard behavior of Spack. Whether you manage to speed up other slow parts of Spack or not, this is not how to do it. |
@@ -253,6 +257,9 @@ def find_by_hash(self, find_hash, mirrors_to_check=None): | |||
mirrors_to_check: Optional mapping containing mirrors to check. If | |||
None, just assumes all configured mirrors. | |||
""" | |||
if find_hash not in self._mirrors_for_spec: | |||
# Not found in the cached index, pull the latest from the server. | |||
self.update(with_cooldown=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bothers me too about this PR. Please do not make read operations do write operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caches as a general rule do a write operation on cache miss. That's what's happening here. This should be normal and I was very surprised to find it not doing something like this before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you wrote is more wrong than the previous implementation was wrong. The point is to do a local existence check to order the mirrors. That should never trigger a remote fetch on failure.
@@ -105,6 +106,9 @@ def __init__(self, cache_root): | |||
# cache (_mirrors_for_spec) | |||
self._specs_already_associated = set() | |||
|
|||
# mapping from mirror urls to the time.time() of the last index fetch. | |||
self._last_fetch_times = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And as far as I can see: this does not help across two spack install
calls :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was concern for sequences of operations like:
$ spack install ...
$ spack buildcache update-index ...
$ spack install ...
If the fetch timers spanned spack install
commands, the latter spack install
may (surprisingly) not see the effect of the spack buildcache update-index
. Same logic as to why update()
has a with_cooldown=False
argument.
Note that index.json.hash
gets fetched first before index.json
(or at least, it's supposed to). So if the index.json
didn't change between calls it won't be fetched a second time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We otherwise invalidate cache by mtime, I was thinking it could make sense not to even make a request if (now - mtime) is small enough.
@blue42u can you please just revert this PR, I see many issues with it. As a workaround, can you go for this idea
instead? |
@haampie FYI, I do not have write access. I'm an external contributor and part of the HPCToolkit team. This PR is one of a batch I wrote to get our CI working. I would prefer not to have this PR reverted, without it What I will do is start writing a followup PR that:
All those together will cost |
Before you do all that, did you try Edit: I quickly tried the following example: import spack.util.s3
import time
def slow(key):
try:
t = time.time()
s = spack.util.s3.create_s3_session("s3://spack-binaries")
s.head_object(Bucket="spack-binaries", Key=key)
finally:
print(time.time() - t)
session = spack.util.s3.create_s3_session("s3://spack-binaries")
def faster(key):
try:
t = time.time()
session.head_object(Bucket="spack-binaries", Key=key)
finally:
print(time.time() - t)
Would probably be better to ensure we don't instantiate a new client each time, and look into these kinds of things in general, before even considering PRs like these. |
Now that I have, I get timings of 100-450ms, with later attempts mostly on the lower end of that range. Unfortunately AFAIK we can't use an
I tried a one-line version of your example, on my development machine (wired connection): $ ./bin/spack python -m timeit -v -s 'import spack.util.s3 as s3' 'try: s3.create_s3_session("s3://spack-binaries").head_object(Bucket="spack-binaries", Key="does/not/exist")' 'except Exception: pass'
$ ./bin/spack python -m timeit -v -s 'import spack.util.s3 as s3' -s 's = s3.create_s3_session("s3://spack-binaries")' 'try: s.head_object(Bucket="spack-binaries", Key="does/not/exist")' 'except Exception: pass' If boto3 authenticates (i.e. I have a key in (I imagine the difference in timings between I also tried using So, reassessment of the situation: The reason this PR helps is because it reduces the number of calls to Note also that This PR hurts @haampie's case because it causes more calls to So, suggested changes (in order of roughly increasing difficulty):
@haampie I think this sequence would be enough to fully solve your scenario, do you see any issue in what I've suggested? |
Makes sense.
|
PR spack#32137 was based on the wrong assumption that direct fetches from mirrors are necessarily slow. The solution (fetch the index to locally see if something is in it) is not scalable. Instead, there's plenty of space to improve the current s3:// fetch logic: it shouldn't create a client every time, the number of 404s can be halved by not trying to fetch s3://[url]/index.html if s3://[url] responds with 404. Etc.
Connecting dots since @haampie is at it over here:
My own comments: |
spack install
will not update the binary index if given a concrete spec, which causes it to fall back to direct fetches when a simple index update would have helped. For S3 buckets in particular, this significantly and needlessly slows down the install process.This commit alters the logic so that the binary index is updated whenever a by-hash lookup fails. The lookup is attempted again with the updated index before falling back to direct fetches.
With 3 nonexistent mirrors + the Spack public binary cache, before the commit:
After this commit: