Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable cooldown for failed index.json fetches #33509

Merged
merged 1 commit into from
Oct 25, 2022

Conversation

blue42u
Copy link
Contributor

@blue42u blue42u commented Oct 25, 2022

My previous PR (#32137) added a cooldown between fetches of a buildcache's index.json when requested, but only when the fetch was successful. In some cases (#32137 (comment)) this causes long repeated fetch failures.

Enable the cooldown for failed fetches as well, and propagate the success/failure status when a fetch is prevented by the cooldown.

@spackbot-app spackbot-app bot added binary-packages core PR affects Spack core functionality labels Oct 25, 2022
@scheibelp scheibelp self-assigned this Oct 25, 2022
@blue42u blue42u force-pushed the pr-index-cooldown-on-failure branch from 5962373 to 38621df Compare October 25, 2022 17:38
else:
# May need to fetch the index and update the local caches
try:
needs_regen = self._fetch_and_cache_index(
cached_mirror_url, expect_hash=cached_index_hash
)
self._last_fetch_times[cached_mirror_url] = now
self._last_fetch_times[cached_mirror_url] = (now, True)
all_methods_failed = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@haampie I have a question that's only tangentially-related to this PR: it appears that if any cached_mirror_url succeeds, then all_methods_failed will be false for all the URLs, so this variable is specifically just tracking whether we could update at least one mirror index successfully. Is that what you wanted explicitly (from examining the git blame, it looks like you added that logic - let me know if I got that wrong)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what it has turned into, but the idea is to have an exception to propagate out of this if nothing worked. Before the pr from my side it was a sad debug message iirc, or no error at all?

lib/spack/spack/binary_distribution.py Show resolved Hide resolved
@scheibelp
Copy link
Member

@spackbot run pipeline

@spackbot-app
Copy link

spackbot-app bot commented Oct 25, 2022

I've started that pipeline for you!

@scheibelp scheibelp merged commit 9d5151b into spack:develop Oct 25, 2022
@scottwittenburg
Copy link
Contributor

While debugging some pipeline generation jobs from the cloud locally, I just pulled this commit in, but it didn't prevent the hundreds of those messages (example below). I'll run with the debugger and see if I can figure out why not.

==> [2022-10-25-18:20:32.321047] Checking existence of https://mirror.spack.io/build_cache/index.json
==> [2022-10-25-18:20:32.466065] Failure reading URL: Download failed: HTTP Error 404: Not Found

@scottwittenburg
Copy link
Contributor

Oops, too late.

@scheibelp
Copy link
Member

Does merging this make it harder to debug? I can push a PR that reverts this. I figured it would be easier to figure out if it were in (normally I wouldn't handle it like that, but I wasn't sure how easy it was to debug the CI behavior with a PR).

@scottwittenburg
Copy link
Contributor

In that case, no worries. You're right it will be easier not to need to pull this in separately.

becker33 pushed a commit to RikkiButler20/spack that referenced this pull request Nov 2, 2022
spack#32137 added an option to update() a BinaryCacheIndex with a
cooldown: repeated attempts within this cooldown would not
actually retry. However, the cooldown was not properly
tracked for failures (which is common when the mirror
does not store any binaries and therefore has no index.json).

This commit ensures that update(..., with_cooldown=True) will
also skip the update even if a failure has occurred within the
cooldown period.
@scottwittenburg
Copy link
Contributor

I did some debugging to get to the bottom of the repeated, failed attempts to fetch the index during pipeline generation. The problem happens if the configuration contains any mirrors without an index. The _fetch_and_cache_index function detects that condition and simply returns False, but given the new behavior, I'm guessing it should maybe also add an entry to self._last_fetch_times?

@blue42u
Copy link
Contributor Author

blue42u commented Nov 7, 2022

@scottwittenburg If _fetch_and_cache_index returns False it should already add an entry to self._last_fetch_times:

needs_regen = self._fetch_and_cache_index(
cached_mirror_url, expect_hash=cached_index_hash
)
self._last_fetch_times[cached_mirror_url] = (now, True)
all_methods_failed = False

Can you check whether it's actually calling _fetch_and_cache_index after the first call (i.e. is this conditional always false)? I'm wondering if the fetch failed in the first call, and after that it's constantly re-raising the FetchCacheError on every call to update() from find_by_hash().

@scottwittenburg
Copy link
Contributor

This is the call site I'm hitting in local testing. When that just returns False because there is no index, I guess it does put an entry into the _last_fetch_times (entering True for the success/failure indicator, is that ok?). And that's the line that triggers the index fetch every time. If False is returned here, then the mirror url never gets put into self._local_cache_index, and hence it never gets into the loop with that conditional you linked above.

@blue42u
Copy link
Contributor Author

blue42u commented Nov 7, 2022

I see now. So in short, this entire code block needs to be put inside a conditional just like this. That would enable the cooldown for "new" mirror urls, which then covers the case where a mirror has no index and is always considered "new."

entering True for the success/failure indicator, is that ok?

That is how the logic worked before my PRs, a _fetch_and_cache_index call that returns is considered "successful" and prevents throwing a FetchCacheError. IMHO the logic doesn't make much sense, but I'm not the author here.

@blue42u blue42u deleted the pr-index-cooldown-on-failure branch November 9, 2022 14:55
@blue42u
Copy link
Contributor Author

blue42u commented Nov 9, 2022

Implementation of the above idea over in #33781.

charmoniumQ pushed a commit to charmoniumQ/spack that referenced this pull request Nov 19, 2022
spack#32137 added an option to update() a BinaryCacheIndex with a
cooldown: repeated attempts within this cooldown would not
actually retry. However, the cooldown was not properly
tracked for failures (which is common when the mirror
does not store any binaries and therefore has no index.json).

This commit ensures that update(..., with_cooldown=True) will
also skip the update even if a failure has occurred within the
cooldown period.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binary-packages core PR affects Spack core functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants