Skip to content

fix(#190): fail image-refresh loudly when the ingestor (ghcr) digest can't resolve#191

Open
saadqbal wants to merge 1 commit into
developfrom
fix/image-refresh-surface-ghcr-failures
Open

fix(#190): fail image-refresh loudly when the ingestor (ghcr) digest can't resolve#191
saadqbal wants to merge 1 commit into
developfrom
fix/image-refresh-surface-ghcr-failures

Conversation

@saadqbal
Copy link
Copy Markdown
Contributor

@saadqbal saadqbal commented Jun 3, 2026

Closes #190. Image-refresh hardening — follow-up #2 of #186 (the digest pin #185 and CI guard #187 covered the chart side).

Why

On the berlin-team arm64 install, jobs-manager stayed on the amd64-only ingestor baseline (sha256:a0861ea9…) even though the image-refresh CronJob ran and :0.3 had become a multi-arch index (sha256:d361fa77…). Root cause: get_latest_digest(… ghcr.io) returned empty in-cluster (egress/proxy/firewall to ghcr.io, or a blocked token endpoint — on-prem installs often allowlist docker.io but not ghcr.io), so the ingestor block hit:

if [ -z "$latest_ingestor" ]; then
  log "  WARN: could not resolve latest digest (rate-limited or transient); skipping this tick"

and silently skipped every tick, never reaching the registry-drift branch. jobs-manager + pods-monitor pull from docker.io and refreshed fine, so the CronJob looked healthy and nothing surfaced the ghcr failure. A persistent failure was indistinguishable from a transient one.

Ruled out while investigating: auto-upgrade does not reset the env (skips when already-latest; --reset-then-reuse-values, no --force); the resolver parses multi-arch indexes correctly (verified live → d361fa77); chart 1.4.3 already includes ingestor refresh.

What

Count consecutive ingestor-resolve failures on a tracebloc.io/ingestor-refresh-consecutive-failures annotation:

  • Below imageRefresh.ingestorResolveFailureThreshold (default 3, ≈45 min at the 15-min schedule) → WARN + skip, as today (tolerate transient blips).
  • At/above it → ERROR with actionable guidance (check ghcr.io egress / token endpoint), a tracebloc.io/ingestor-refresh-last-error annotation, and a non-zero exit so the Job fails visibly in kubectl get cronjob / monitoring — the same "failed Job = operator-visible" idiom Pass 2's stuck-rollout check already uses.
  • A successful resolve clears the streak.

Threshold is nil-guarded (| default 3) for --reuse-values upgrades and schema-validated (integer ≥ 1). The digest-resolution logic is unchanged.

Changed

  • templates/image-refresh-cronjob.yaml — failure-counting + escalation; new INGESTOR_RESOLVE_FAIL_THRESHOLD env (nil-guarded) + in-script :=3 default.
  • values.yaml / values.schema.jsonimageRefresh.ingestorResolveFailureThreshold (default 3, min 1).
  • tests/image_refresh_test.yaml — default / --reuse-values fallback / override / schema-min tests.

Verified

  • helm unittest 146/146 (4 new); helm lint clean; shellcheck -s sh + sh -n clean on the rendered script.

Notes

🤖 Generated with Claude Code


Note

Medium Risk
Changes failure semantics for the ingestor-only ghcr resolve path (CronJob can fail and patch deployment annotations), but successful refresh and digest logic are unchanged; mis-tuned thresholds could add noise or delay escalation.

Overview
When ghcr.io ingestor digest lookup fails, image-refresh no longer WARN + skip every tick forever. It tracks consecutive failures on the jobs-manager deployment (tracebloc.io/ingestor-refresh-consecutive-failures), still tolerates transient blips below a threshold, then fails the CronJob with actionable logs and tracebloc.io/ingestor-refresh-last-error so operators see stuck ingestor auto-refresh (e.g. docker.io works but ghcr.io is blocked). A successful resolve clears those annotations.

imageRefresh.ingestorResolveFailureThreshold (default 3, schema min 1) is wired through INGESTOR_RESOLVE_FAIL_THRESHOLD on the refresh pod, with template and in-script defaults for --reuse-values / hand-edited CronJobs. docker.io class-1 images are unchanged (still warn-and-skip). Helm unittest covers default, null fallback, override, and schema rejection.

Reviewed by Cursor Bugbot for commit 52beeb0. Bugbot is set up for automated code reviews on this repo. Configure here.

…can't resolve

image-refresh silently skipped every tick when get_latest_digest returned
empty for the ghcr.io ingestor image (egress/proxy/firewall to ghcr.io, or a
blocked token endpoint) — never reaching the registry-drift branch that sets
the new digest. jobs-manager + pods-monitor pull from docker.io and refreshed
fine, so the CronJob looked healthy while the ingestor digest stayed pinned on
the install-time baseline. That's why the berlin-team arm64 install sat on the
amd64-only v0.3.1 digest even after :0.3 went multi-arch (#186 follow-up #2).

Now count consecutive ingestor-resolve failures on a deployment annotation:
- below imageRefresh.ingestorResolveFailureThreshold (default 3, ~45 min at the
  15-min schedule) -> WARN + skip, as before (tolerate transient blips);
- at/above it -> ERROR with actionable guidance, a
  tracebloc.io/ingestor-refresh-last-error annotation, and a non-zero exit so
  the Job fails visibly in `kubectl get cronjob` / monitoring — the same
  surfacing idiom Pass 2's stuck-rollout check already relies on;
- a successful resolve clears the streak.

Threshold is nil-guarded (default 3) for --reuse-values upgrades and
schema-validated (integer >= 1). The digest-resolution logic itself is
unchanged (verified correct: it returns the multi-arch index digest).

helm unittest 146/146, helm lint clean, shellcheck + sh -n clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@LukasWodka
Copy link
Copy Markdown
Contributor

👋 Heads-up — Code review queue is at 10 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

Copy link
Copy Markdown

@aptracebloc aptracebloc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants