fix(#190): fail image-refresh loudly when the ingestor (ghcr) digest can't resolve by saadqbal · Pull Request #191 · tracebloc/client

saadqbal · 2026-06-03T08:53:05Z

Closes #190. Image-refresh hardening — follow-up #2 of #186 (the digest pin #185 and CI guard #187 covered the chart side).

Why

On the berlin-team arm64 install, jobs-manager stayed on the amd64-only ingestor baseline (sha256:a0861ea9…) even though the image-refresh CronJob ran and :0.3 had become a multi-arch index (sha256:d361fa77…). Root cause: get_latest_digest(… ghcr.io) returned empty in-cluster (egress/proxy/firewall to ghcr.io, or a blocked token endpoint — on-prem installs often allowlist docker.io but not ghcr.io), so the ingestor block hit:

if [ -z "$latest_ingestor" ]; then
  log "  WARN: could not resolve latest digest (rate-limited or transient); skipping this tick"

and silently skipped every tick, never reaching the registry-drift branch. jobs-manager + pods-monitor pull from docker.io and refreshed fine, so the CronJob looked healthy and nothing surfaced the ghcr failure. A persistent failure was indistinguishable from a transient one.

Ruled out while investigating: auto-upgrade does not reset the env (skips when already-latest; --reset-then-reuse-values, no --force); the resolver parses multi-arch indexes correctly (verified live → d361fa77); chart 1.4.3 already includes ingestor refresh.

What

Count consecutive ingestor-resolve failures on a tracebloc.io/ingestor-refresh-consecutive-failures annotation:

Below imageRefresh.ingestorResolveFailureThreshold (default 3, ≈45 min at the 15-min schedule) → WARN + skip, as today (tolerate transient blips).
At/above it → ERROR with actionable guidance (check ghcr.io egress / token endpoint), a tracebloc.io/ingestor-refresh-last-error annotation, and a non-zero exit so the Job fails visibly in kubectl get cronjob / monitoring — the same "failed Job = operator-visible" idiom Pass 2's stuck-rollout check already uses.
A successful resolve clears the streak.

Threshold is nil-guarded (| default 3) for --reuse-values upgrades and schema-validated (integer ≥ 1). The digest-resolution logic is unchanged.

Changed

templates/image-refresh-cronjob.yaml — failure-counting + escalation; new INGESTOR_RESOLVE_FAIL_THRESHOLD env (nil-guarded) + in-script :=3 default.
values.yaml / values.schema.json — imageRefresh.ingestorResolveFailureThreshold (default 3, min 1).
tests/image_refresh_test.yaml — default / --reuse-values fallback / override / schema-min tests.

Verified

helm unittest 146/146 (4 new); helm lint clean; shellcheck -s sh + sh -n clean on the rendered script.

Notes

No Chart.yaml bump — develop accumulates at 1.4.4 (matching ci(helm): guard that the pinned ingestor digest is multi-arch (closes #186) #187); release-sync handles versioning.
This complements but doesn't replace chore(#184): pin greenfield ingestor baseline to v0.3.2 #185 (multi-arch baseline as the floor) — together: the chart pins a runnable digest, the CI guard (ci(helm): guard that the pinned ingestor digest is multi-arch (closes #186) #187) keeps it multi-arch, and image-refresh now fails loudly instead of silently when it can't track the float.

🤖 Generated with Claude Code

Note

Medium Risk
Changes failure semantics for the ingestor-only ghcr resolve path (CronJob can fail and patch deployment annotations), but successful refresh and digest logic are unchanged; mis-tuned thresholds could add noise or delay escalation.

Overview
When ghcr.io ingestor digest lookup fails, image-refresh no longer WARN + skip every tick forever. It tracks consecutive failures on the jobs-manager deployment (tracebloc.io/ingestor-refresh-consecutive-failures), still tolerates transient blips below a threshold, then fails the CronJob with actionable logs and tracebloc.io/ingestor-refresh-last-error so operators see stuck ingestor auto-refresh (e.g. docker.io works but ghcr.io is blocked). A successful resolve clears those annotations.

imageRefresh.ingestorResolveFailureThreshold (default 3, schema min 1) is wired through INGESTOR_RESOLVE_FAIL_THRESHOLD on the refresh pod, with template and in-script defaults for --reuse-values / hand-edited CronJobs. docker.io class-1 images are unchanged (still warn-and-skip). Helm unittest covers default, null fallback, override, and schema rejection.

^{Reviewed by Cursor Bugbot for commit 52beeb0. Bugbot is set up for automated code reviews on this repo. Configure here.}

…can't resolve image-refresh silently skipped every tick when get_latest_digest returned empty for the ghcr.io ingestor image (egress/proxy/firewall to ghcr.io, or a blocked token endpoint) — never reaching the registry-drift branch that sets the new digest. jobs-manager + pods-monitor pull from docker.io and refreshed fine, so the CronJob looked healthy while the ingestor digest stayed pinned on the install-time baseline. That's why the berlin-team arm64 install sat on the amd64-only v0.3.1 digest even after :0.3 went multi-arch (#186 follow-up #2). Now count consecutive ingestor-resolve failures on a deployment annotation: - below imageRefresh.ingestorResolveFailureThreshold (default 3, ~45 min at the 15-min schedule) -> WARN + skip, as before (tolerate transient blips); - at/above it -> ERROR with actionable guidance, a tracebloc.io/ingestor-refresh-last-error annotation, and a non-zero exit so the Job fails visibly in `kubectl get cronjob` / monitoring — the same surfacing idiom Pass 2's stuck-rollout check already relies on; - a successful resolve clears the streak. Threshold is nil-guarded (default 3) for --reuse-values upgrades and schema-validated (integer >= 1). The digest-resolution logic itself is unchanged (verified correct: it returns the multi-arch index digest). helm unittest 146/146, helm lint clean, shellcheck + sh -n clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

LukasWodka · 2026-06-03T08:53:50Z

👋 Heads-up — Code review queue is at 10 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

averaging-service#98 — ci: add fr-gate caller workflow · author: @LukasWodka · no reviewer assigned
data-ingestors#146 — test(schema): lock category enum to the engine's TaskCategory (anti-drift) · author: @LukasWodka · reviewer: @saadqbal
design-system#19 — fix: un-track coverage/ and node_modules/ from git · author: @LukasWodka · no reviewer assigned
design-system#22 — ci: add Vitest test workflow · author: @LukasWodka · reviewer: @aptracebloc
design-system#23 — ci: add fr-gate caller workflow · author: @LukasWodka · no reviewer assigned
docs#46 — docs: make declarative-ingest staging self-contained (data-ingestors#131 B/C) · author: @divyasinghds · no reviewer assigned
frontend-app#499 — ci: add Vitest + Cypress test workflow · author: @LukasWodka · reviewer: @aptracebloc
model-zoo#77 — ci: add fr-gate caller workflow · author: @LukasWodka · no reviewer assigned
model-zoo#85 — Develop · author: @divyasinghds · no reviewer assigned

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

aptracebloc

Looks good!

This was referenced Jun 3, 2026

docs: drop Homebrew from README until the tap ships tracebloc/cli#20

Merged

docs(cli): refresh stale root --help string tracebloc/cli#21

Merged

saadqbal self-assigned this Jun 3, 2026

aptracebloc approved these changes Jun 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(#190): fail image-refresh loudly when the ingestor (ghcr) digest can't resolve#191

fix(#190): fail image-refresh loudly when the ingestor (ghcr) digest can't resolve#191
saadqbal wants to merge 1 commit into
developfrom
fix/image-refresh-surface-ghcr-failures

saadqbal commented Jun 3, 2026 •

edited by cursor Bot

Loading

Uh oh!

LukasWodka commented Jun 3, 2026

Uh oh!

aptracebloc left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

saadqbal commented Jun 3, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What

Changed

Verified

Notes

Uh oh!

LukasWodka commented Jun 3, 2026

Uh oh!

aptracebloc left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

saadqbal commented Jun 3, 2026 •

edited by cursor Bot

Loading