fix(#190): fail image-refresh loudly when the ingestor (ghcr) digest can't resolve#191
Open
saadqbal wants to merge 1 commit into
Open
fix(#190): fail image-refresh loudly when the ingestor (ghcr) digest can't resolve#191saadqbal wants to merge 1 commit into
saadqbal wants to merge 1 commit into
Conversation
…can't resolve image-refresh silently skipped every tick when get_latest_digest returned empty for the ghcr.io ingestor image (egress/proxy/firewall to ghcr.io, or a blocked token endpoint) — never reaching the registry-drift branch that sets the new digest. jobs-manager + pods-monitor pull from docker.io and refreshed fine, so the CronJob looked healthy while the ingestor digest stayed pinned on the install-time baseline. That's why the berlin-team arm64 install sat on the amd64-only v0.3.1 digest even after :0.3 went multi-arch (#186 follow-up #2). Now count consecutive ingestor-resolve failures on a deployment annotation: - below imageRefresh.ingestorResolveFailureThreshold (default 3, ~45 min at the 15-min schedule) -> WARN + skip, as before (tolerate transient blips); - at/above it -> ERROR with actionable guidance, a tracebloc.io/ingestor-refresh-last-error annotation, and a non-zero exit so the Job fails visibly in `kubectl get cronjob` / monitoring — the same surfacing idiom Pass 2's stuck-rollout check already relies on; - a successful resolve clears the streak. Threshold is nil-guarded (default 3) for --reuse-values upgrades and schema-validated (integer >= 1). The digest-resolution logic itself is unchanged (verified correct: it returns the multi-arch index digest). helm unittest 146/146, helm lint clean, shellcheck + sh -n clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Contributor
|
👋 Heads-up — Code review queue is at 10 / 8 Above the WIP limit. The team convention is to review existing PRs before opening new work. Open PRs currently in Code review (oldest first):
Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.) |
This was referenced Jun 3, 2026
This was referenced Jun 3, 2026
Open
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #190. Image-refresh hardening — follow-up #2 of #186 (the digest pin #185 and CI guard #187 covered the chart side).
Why
On the
berlin-teamarm64 install, jobs-manager stayed on the amd64-only ingestor baseline (sha256:a0861ea9…) even though the image-refresh CronJob ran and:0.3had become a multi-arch index (sha256:d361fa77…). Root cause:get_latest_digest(… ghcr.io)returned empty in-cluster (egress/proxy/firewall toghcr.io, or a blocked token endpoint — on-prem installs often allowlistdocker.iobut notghcr.io), so the ingestor block hit:and silently skipped every tick, never reaching the registry-drift branch. jobs-manager + pods-monitor pull from
docker.ioand refreshed fine, so the CronJob looked healthy and nothing surfaced the ghcr failure. A persistent failure was indistinguishable from a transient one.What
Count consecutive ingestor-resolve failures on a
tracebloc.io/ingestor-refresh-consecutive-failuresannotation:imageRefresh.ingestorResolveFailureThreshold(default 3, ≈45 min at the 15-min schedule) →WARN+ skip, as today (tolerate transient blips).ERRORwith actionable guidance (checkghcr.ioegress / token endpoint), atracebloc.io/ingestor-refresh-last-errorannotation, and a non-zero exit so the Job fails visibly inkubectl get cronjob/ monitoring — the same "failed Job = operator-visible" idiom Pass 2's stuck-rollout check already uses.Threshold is nil-guarded (
| default 3) for--reuse-valuesupgrades and schema-validated (integer ≥ 1). The digest-resolution logic is unchanged.Changed
templates/image-refresh-cronjob.yaml— failure-counting + escalation; newINGESTOR_RESOLVE_FAIL_THRESHOLDenv (nil-guarded) + in-script:=3default.values.yaml/values.schema.json—imageRefresh.ingestorResolveFailureThreshold(default 3, min 1).tests/image_refresh_test.yaml— default /--reuse-valuesfallback / override / schema-min tests.Verified
helm unittest146/146 (4 new);helm lintclean;shellcheck -s sh+sh -nclean on the rendered script.Notes
Chart.yamlbump — develop accumulates at 1.4.4 (matching ci(helm): guard that the pinned ingestor digest is multi-arch (closes #186) #187); release-sync handles versioning.🤖 Generated with Claude Code
Note
Medium Risk
Changes failure semantics for the ingestor-only ghcr resolve path (CronJob can fail and patch deployment annotations), but successful refresh and digest logic are unchanged; mis-tuned thresholds could add noise or delay escalation.
Overview
When ghcr.io ingestor digest lookup fails, image-refresh no longer WARN + skip every tick forever. It tracks consecutive failures on the jobs-manager deployment (
tracebloc.io/ingestor-refresh-consecutive-failures), still tolerates transient blips below a threshold, then fails the CronJob with actionable logs andtracebloc.io/ingestor-refresh-last-errorso operators see stuck ingestor auto-refresh (e.g. docker.io works but ghcr.io is blocked). A successful resolve clears those annotations.imageRefresh.ingestorResolveFailureThreshold(default 3, schema min 1) is wired throughINGESTOR_RESOLVE_FAIL_THRESHOLDon the refresh pod, with template and in-script defaults for--reuse-values/ hand-edited CronJobs. docker.io class-1 images are unchanged (still warn-and-skip). Helm unittest covers default, null fallback, override, and schema rejection.Reviewed by Cursor Bugbot for commit 52beeb0. Bugbot is set up for automated code reviews on this repo. Configure here.