feat(#158): auto-refresh ingestor image digest without chart release#159
Merged
Conversation
Extends the existing image-refresh CronJob (#155) to also reconcile the ghcr.io/tracebloc/ingestor digest onto the live jobs-manager deployment's INGESTOR_IMAGE_DIGEST env var. New ingestor image publishes to the floating tag are now picked up within the cronjob's poll interval (~15 min) instead of requiring a full chart release. Why Today, shipping a new ghcr.io/tracebloc/ingestor image required bumping client/values.yaml images.ingestor.digest + client/Chart.yaml + PR + sync to main + release tag. That's hours of overhead per bump and asymmetric with jobs-manager (which already gets the ~15-min image-refresh path). The asymmetry hurts because the ingestor changes frequently as the data-ingestors team iterates. Design Two image classes in one CronJob now: Class 1 (jobs-manager, pods-monitor): Registry: docker.io Tag: CLIENT_ENV Source of truth: deployment annotation `tracebloc.io/last-refreshed-<image>-digest` (#154) Action on change: kubectl rollout restart Class 2 (ingestor): Registry: ghcr.io Tag: images.ingestor.tag (default "prod") Source of truth: live INGESTOR_IMAGE_DIGEST env value on the api container of the jobs-manager deployment (no annotation needed — the env IS the digest jobs-manager passes to each spawned ingestion Job, so the most direct read of "what will be used next" is THIS value). Action on change: kubectl set env (triggers natural rollout via ReplicaSet rotation — no explicit `rollout restart`). get_token + get_latest_digest parameterized by registry; both docker.io and ghcr.io support anonymous pull tokens for public images with only the issuer URL differing. Per-image opt-out * jobs-manager / pods-monitor: same as #154 — set `images.<image>.digest` non-empty. * ingestor: explicit `images.ingestor.autoRefresh: false` flag. Asymmetric because ingestor.digest must be non-empty for jobs-manager to function (an empty env would 503 every ingestion submit), so we can't use digest-presence as the signal. When ALL THREE pin signals are active, the chart renders no image-refresh resources at all (helper `imageRefreshEnabled`). When at least one is unpinned, the cronjob is rendered and the script skips pinned images via env flags at runtime. Chart-default ingestor digest stays pinned (v0.3.0) as the baseline for greenfield installs; image-refresh dynamically updates the live env from there. Helm's 3-way merge preserves image-refresh's writes across future helm upgrades as long as the chart's pinned baseline doesn't change. Subtle gotcha caught in dev `default true $autoRefresh` in Go templates returns `true` even when $autoRefresh is explicitly `false` (Go treats bool false as falsy, so default overrides it). Switched to `eq $autoRefresh false` directly — absence (nil) and explicit `true` both fall through to "not pinned" as intended. Test pinned against the correct idiom. Other changes * `log()` continues to write to stderr (#155 fix). * `get_container_env` helper for jq-based env-var reads — same kubectl-jsonpath caveat as `get_annotation` (#156). * Chart version bumped 1.4.0 → 1.4.1. Tests 20 image-refresh-suite tests (was 17), 140 total pass. New assertions: * all-three-pinned renders zero resources * only-jobs-manager+pods-monitor-pinned still renders (regression guard for the asymmetric pin signal — without this, the ingestor would never auto-refresh on default installs) * INGESTOR_PINNED flips correctly on autoRefresh=false * INGESTOR_TAG is overridable, `latest` rejected by schema * Script must include `kubectl set env`, `ghcr.io`, `auth.docker.io`, `get_container_env`, the empty-env fill-from-registry path, and the autoRefresh-skip log line Closes #158 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Contributor
|
👋 Heads-up — Code review queue is at 16 / 8 Above the WIP limit. The team convention is to review existing PRs before opening new work. Open PRs currently in Code review (oldest first):
Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.) |
…e retry) Caught in PR #159 review (bugbot, medium severity). The original design used the live spec env (`INGESTOR_IMAGE_DIGEST` on the api container) as the source of truth for "what image-refresh has reconciled to." `kubectl set env` commits the new spec to etcd BEFORE `kubectl rollout status` waits for the rollout to complete. If the rollout times out or the new ReplicaSet's pods fail to come up: * `set -eu` aborts the script. * But the spec already matches the registry. * Next tick: `get_container_env` returns the new digest, compares equal to registry, no-op → script appears successful. * Meanwhile the old ReplicaSet's pods are still running with the OLD env, and the new ReplicaSet is stuck failing. The deployment is frozen on the old version with no retry signal. Fix: mirror the jobs-manager/pods-monitor pattern from #154. Use a `tracebloc.io/last-refreshed-ingestor-digest` annotation on the deployment as the source of truth. Update the annotation as the LAST step, only after `rollout status` succeeds. A failed rollout aborts before the annotate → next tick sees stale annotation → retries. First-observation contract for ingestor: * Non-empty spec env (the normal case — chart populates a default): adopt as baseline annotation, don't touch env. Same "don't churn on install" principle as jobs-manager first-observation. * Empty spec env (corrupted state, manual kubectl edit, stale --reuse-values): fill from registry on first tick. Empty would otherwise cause jobs-manager to 503 on every ingestion submit, so the "don't churn" trade is wrong in that case. Tests pin: * Annotation key `tracebloc.io/last-refreshed-ingestor-digest` appears in the script. * Order-of-operations: set env → rollout status → annotate (the annotate MUST come last; regex matches the full sequence). 140/140 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Found during PR #159 self-review. The top-of-script comment block still described the pre-bugbot-fix design — claimed "Source of truth: the live env value itself (no annotation needed)" and "no 'first observation' empty-state, each tick is a normal compare-and-patch-if-different." Both were stale after e7cf829 switched ingestor to annotation-based source of truth and added the first-observation branch. Anyone reading the script-level overview would have been misled about the actual loop. Comment now matches the code: annotation as source of truth, two- case first-observation contract (non-empty → adopt as baseline; empty → fill from registry). Also adds a positive regression test for the previously-untested first-observation "adopting spec env as baseline" branch. The empty-spec-env branch was already covered indirectly by the existing "would 503 on ingestion submit" regex. 140/140 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…tOps) Caught in PR #159 review (bugbot, medium severity). Follow-up to the annotation-source-of-truth switch in e7cf829. The previous no-op branch fired whenever annotation == registry and NEVER read the live spec env. That meant any external actor that reverted the spec env without touching the annotation would leave the deployment on a stale env indefinitely. The annotation continued to match the registry, so image-refresh kept skipping. Real scenarios this affects: * `kubectl rollout undo deployment/X` — reverts pod template to a previous ReplicaSet's spec, including its INGESTOR_IMAGE_DIGEST env. Annotation on deployment metadata is untouched. * `kubectl edit deployment X` — operator manually changes the env. * Certain `helm upgrade` flag combos can reset env to the chart's pre-image-refresh baseline while preserving annotations (e.g., --reset-values or upgrade from a chart where the digest baseline differs from what image-refresh had reconciled to). * GitOps reconcilers (Argo CD, Flux) that own the deployment spec will revert image-refresh's env writes back to the rendered template values. In all of these, the live deployment runs a stale ingestor image forever — exactly the failure mode #158 was meant to prevent. Fix: each tick now reads both the annotation AND the live spec env. Three reconciliation paths: * recorded != registry → "registry drift". Set env to registry, wait for rollout, update annotation. (Existing behaviour.) * recorded == registry AND spec env != recorded → "env drift". Set env to recorded value (NOT registry — registry matches recorded by definition here, but recorded is the value we last decided to roll to). Wait for rollout. Don't update the annotation; it's already correct. * recorded == registry AND spec env == recorded → fully in sync, no-op. Tests pin: * The "spec env drifted" log line. * The drift-recovery branch sets env to `${recorded_ingestor}`, not `${latest_ingestor}` (different from the registry-drift branch). Regex catches the variable used in `INGESTOR_IMAGE_DIGEST=`. Top-of-script comment block updated to document the drift recovery. 140/140 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced May 25, 2026
This was referenced May 25, 2026
…resh-ingestor # Conflicts: # client/values.yaml
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit dda3891. Configure here.
Caught in PR #159 review (bugbot, high severity). Scenario the existing code mishandles: 1. Tick N: empty-spec-first-obs branch runs `kubectl set env` (commits new spec to etcd) → `kubectl rollout status` times out → `set -eu` aborts before the annotate. Annotation stays empty. 2. Tick N+1: annotation still empty. spec env is now non-empty (the failed-rollout's spec change persists). get_container_env returns that value, so the adopt-as-baseline branch fires. 3. Adopt-as-baseline only annotates — it never checks rollout health. Annotation records "we're at D1" while running pods are still on the old/empty env from before tick N. The deployment now appears reconciled (annotation == registry on subsequent ticks) while actually being stuck on the wrong image. Fix: call `kubectl rollout status` inside the adopt-as-baseline branch before the annotate. On a healthy deployment it returns near-instantly; on a stuck rollout from a previous failed set-env it times out, `set -eu` aborts before the annotate, next tick retries. No latency cost on the happy path. Regression test pins the (?s)-multiline order: adopting → rollout status → annotate. 140/140 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
waqaskhanroghani
approved these changes
May 25, 2026
saadqbal
added a commit
that referenced
this pull request
May 26, 2026
* Merge pull request #161 from tracebloc/chore/pin-ingestor-v0.3.1-160 chore: pin client chart's ingestor digest to v0.3.1 * feat(#158): auto-refresh ingestor image digest without chart release (#159) * feat(#158): auto-refresh ingestor image digest without chart release Extends the existing image-refresh CronJob (#155) to also reconcile the ghcr.io/tracebloc/ingestor digest onto the live jobs-manager deployment's INGESTOR_IMAGE_DIGEST env var. New ingestor image publishes to the floating tag are now picked up within the cronjob's poll interval (~15 min) instead of requiring a full chart release. Why Today, shipping a new ghcr.io/tracebloc/ingestor image required bumping client/values.yaml images.ingestor.digest + client/Chart.yaml + PR + sync to main + release tag. That's hours of overhead per bump and asymmetric with jobs-manager (which already gets the ~15-min image-refresh path). The asymmetry hurts because the ingestor changes frequently as the data-ingestors team iterates. Design Two image classes in one CronJob now: Class 1 (jobs-manager, pods-monitor): Registry: docker.io Tag: CLIENT_ENV Source of truth: deployment annotation `tracebloc.io/last-refreshed-<image>-digest` (#154) Action on change: kubectl rollout restart Class 2 (ingestor): Registry: ghcr.io Tag: images.ingestor.tag (default "prod") Source of truth: live INGESTOR_IMAGE_DIGEST env value on the api container of the jobs-manager deployment (no annotation needed — the env IS the digest jobs-manager passes to each spawned ingestion Job, so the most direct read of "what will be used next" is THIS value). Action on change: kubectl set env (triggers natural rollout via ReplicaSet rotation — no explicit `rollout restart`). get_token + get_latest_digest parameterized by registry; both docker.io and ghcr.io support anonymous pull tokens for public images with only the issuer URL differing. Per-image opt-out * jobs-manager / pods-monitor: same as #154 — set `images.<image>.digest` non-empty. * ingestor: explicit `images.ingestor.autoRefresh: false` flag. Asymmetric because ingestor.digest must be non-empty for jobs-manager to function (an empty env would 503 every ingestion submit), so we can't use digest-presence as the signal. When ALL THREE pin signals are active, the chart renders no image-refresh resources at all (helper `imageRefreshEnabled`). When at least one is unpinned, the cronjob is rendered and the script skips pinned images via env flags at runtime. Chart-default ingestor digest stays pinned (v0.3.0) as the baseline for greenfield installs; image-refresh dynamically updates the live env from there. Helm's 3-way merge preserves image-refresh's writes across future helm upgrades as long as the chart's pinned baseline doesn't change. Subtle gotcha caught in dev `default true $autoRefresh` in Go templates returns `true` even when $autoRefresh is explicitly `false` (Go treats bool false as falsy, so default overrides it). Switched to `eq $autoRefresh false` directly — absence (nil) and explicit `true` both fall through to "not pinned" as intended. Test pinned against the correct idiom. Other changes * `log()` continues to write to stderr (#155 fix). * `get_container_env` helper for jq-based env-var reads — same kubectl-jsonpath caveat as `get_annotation` (#156). * Chart version bumped 1.4.0 → 1.4.1. Tests 20 image-refresh-suite tests (was 17), 140 total pass. New assertions: * all-three-pinned renders zero resources * only-jobs-manager+pods-monitor-pinned still renders (regression guard for the asymmetric pin signal — without this, the ingestor would never auto-refresh on default installs) * INGESTOR_PINNED flips correctly on autoRefresh=false * INGESTOR_TAG is overridable, `latest` rejected by schema * Script must include `kubectl set env`, `ghcr.io`, `auth.docker.io`, `get_container_env`, the empty-env fill-from-registry path, and the autoRefresh-skip log line Closes #158 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(#158): annotation as source of truth for ingestor (rollout-failure retry) Caught in PR #159 review (bugbot, medium severity). The original design used the live spec env (`INGESTOR_IMAGE_DIGEST` on the api container) as the source of truth for "what image-refresh has reconciled to." `kubectl set env` commits the new spec to etcd BEFORE `kubectl rollout status` waits for the rollout to complete. If the rollout times out or the new ReplicaSet's pods fail to come up: * `set -eu` aborts the script. * But the spec already matches the registry. * Next tick: `get_container_env` returns the new digest, compares equal to registry, no-op → script appears successful. * Meanwhile the old ReplicaSet's pods are still running with the OLD env, and the new ReplicaSet is stuck failing. The deployment is frozen on the old version with no retry signal. Fix: mirror the jobs-manager/pods-monitor pattern from #154. Use a `tracebloc.io/last-refreshed-ingestor-digest` annotation on the deployment as the source of truth. Update the annotation as the LAST step, only after `rollout status` succeeds. A failed rollout aborts before the annotate → next tick sees stale annotation → retries. First-observation contract for ingestor: * Non-empty spec env (the normal case — chart populates a default): adopt as baseline annotation, don't touch env. Same "don't churn on install" principle as jobs-manager first-observation. * Empty spec env (corrupted state, manual kubectl edit, stale --reuse-values): fill from registry on first tick. Empty would otherwise cause jobs-manager to 503 on every ingestion submit, so the "don't churn" trade is wrong in that case. Tests pin: * Annotation key `tracebloc.io/last-refreshed-ingestor-digest` appears in the script. * Order-of-operations: set env → rollout status → annotate (the annotate MUST come last; regex matches the full sequence). 140/140 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(#158): correct stale top-of-script comment + add first-obs test Found during PR #159 self-review. The top-of-script comment block still described the pre-bugbot-fix design — claimed "Source of truth: the live env value itself (no annotation needed)" and "no 'first observation' empty-state, each tick is a normal compare-and-patch-if-different." Both were stale after e7cf829 switched ingestor to annotation-based source of truth and added the first-observation branch. Anyone reading the script-level overview would have been misled about the actual loop. Comment now matches the code: annotation as source of truth, two- case first-observation contract (non-empty → adopt as baseline; empty → fill from registry). Also adds a positive regression test for the previously-untested first-observation "adopting spec env as baseline" branch. The empty-spec-env branch was already covered indirectly by the existing "would 503 on ingestion submit" regex. 140/140 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(#158): also reconcile env drift (rollout undo / kubectl edit / GitOps) Caught in PR #159 review (bugbot, medium severity). Follow-up to the annotation-source-of-truth switch in e7cf829. The previous no-op branch fired whenever annotation == registry and NEVER read the live spec env. That meant any external actor that reverted the spec env without touching the annotation would leave the deployment on a stale env indefinitely. The annotation continued to match the registry, so image-refresh kept skipping. Real scenarios this affects: * `kubectl rollout undo deployment/X` — reverts pod template to a previous ReplicaSet's spec, including its INGESTOR_IMAGE_DIGEST env. Annotation on deployment metadata is untouched. * `kubectl edit deployment X` — operator manually changes the env. * Certain `helm upgrade` flag combos can reset env to the chart's pre-image-refresh baseline while preserving annotations (e.g., --reset-values or upgrade from a chart where the digest baseline differs from what image-refresh had reconciled to). * GitOps reconcilers (Argo CD, Flux) that own the deployment spec will revert image-refresh's env writes back to the rendered template values. In all of these, the live deployment runs a stale ingestor image forever — exactly the failure mode #158 was meant to prevent. Fix: each tick now reads both the annotation AND the live spec env. Three reconciliation paths: * recorded != registry → "registry drift". Set env to registry, wait for rollout, update annotation. (Existing behaviour.) * recorded == registry AND spec env != recorded → "env drift". Set env to recorded value (NOT registry — registry matches recorded by definition here, but recorded is the value we last decided to roll to). Wait for rollout. Don't update the annotation; it's already correct. * recorded == registry AND spec env == recorded → fully in sync, no-op. Tests pin: * The "spec env drifted" log line. * The drift-recovery branch sets env to `${recorded_ingestor}`, not `${latest_ingestor}` (different from the registry-drift branch). Regex catches the variable used in `INGESTOR_IMAGE_DIGEST=`. Top-of-script comment block updated to document the drift recovery. 140/140 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(#158): wait for rollout status in adopt-as-baseline branch Caught in PR #159 review (bugbot, high severity). Scenario the existing code mishandles: 1. Tick N: empty-spec-first-obs branch runs `kubectl set env` (commits new spec to etcd) → `kubectl rollout status` times out → `set -eu` aborts before the annotate. Annotation stays empty. 2. Tick N+1: annotation still empty. spec env is now non-empty (the failed-rollout's spec change persists). get_container_env returns that value, so the adopt-as-baseline branch fires. 3. Adopt-as-baseline only annotates — it never checks rollout health. Annotation records "we're at D1" while running pods are still on the old/empty env from before tick N. The deployment now appears reconciled (annotation == registry on subsequent ticks) while actually being stuck on the wrong image. Fix: call `kubectl rollout status` inside the adopt-as-baseline branch before the annotate. On a healthy deployment it returns near-instantly; on a stuck rollout from a previous failed set-env it times out, `set -eu` aborts before the annotate, next tick retries. No latency cost on the happy path. Regression test pins the (?s)-multiline order: adopting → rollout status → annotate. 140/140 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * fix(#158): two bugbot follow-ups + chart default tag Caught in PR #162 review (bugbot, two medium-severity issues). 1. Env-drift rollout retry gap The no-op branch (annotation == registry AND spec env == recorded) was a bare log statement with no rollout-health verification. A previous tick's env-drift `kubectl set env` commits its spec change to etcd BEFORE `kubectl rollout status` waits for the new ReplicaSet to come up. If the rollout fails, `set -eu` aborts — but the spec write persists. Next tick: annotation, registry, and spec env all match (because the spec write committed), so the no-op branch fires and silently masks the stuck rollout. Running pods may be on the old or empty INGESTOR_IMAGE_DIGEST while the script reports success. Fix: call `kubectl rollout status` in the no-op branch too. On a healthy deployment it returns near-instantly (no active rollout to wait for). On a stuck deployment it times out, set -eu aborts, and the Job is visibly failed in `kubectl get cronjob`. The operator then sees the stuck state and can investigate. Image- refresh can't autonomously recover from a bad image push, but making the failure visible is the right behaviour. 2. Default ingestor tag mismatched team's publishing convention Chart defaulted `images.ingestor.tag: prod`. The team's ghcr.io/tracebloc/ingestor repo uses semver-style float tags (`0`, `0.3`) — there is no `prod` tag. Default install would silently no-op every tick because manifest resolution 404'd: curl ... ghcr.io/v2/.../manifests/prod → 404 log " WARN: could not resolve latest digest; skipping" The whole ingestor auto-refresh feature wouldn't work for any customer running the chart's defaults, despite `autoRefresh: true`. Fix: changed default to "0.3" (conservative — patch-only auto- track; won't pick up a future 0.4 with breaking changes). Operators can override to "0" if they want major-version auto-tracking. Long-term, the team should consider standardising the chart default once the data-ingestors release-image.yml formalises its tag-publishing contract — for now this matches what we tested with on the dev cluster. Regression tests: * Default tag asserted as "0.3" with `notContains` of "prod" to guard against silent revert. * No-op branch asserted to call `kubectl rollout status` via (?s)-multiline regex matching the "verifying deployment health" log line + the kubectl rollout status call. * Existing test updated from value: prod to value: "0.3". 141/141 unit tests pass. NB: these commits are landing on the sync branch directly to avoid another full develop-PR cycle before release. After #162 merges, the same content will need to flow back to develop — either via a "sync main → develop" PR or by cherry-picking the two commits. The divergence is two commits and is easy to resolve. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(#158): align INGESTOR_TAG runtime fallback with chart default Caught in PR #162 review (bugbot, medium severity). The chart default was changed from "prod" to "0.3" in the values block — matches the team's ghcr.io publishing convention — but the CronJob template's runtime fallback was left at `| default "prod"`. Two render paths: * helm install / helm upgrade --reset-then-reuse-values: the chart's new default ("0.3") flows through, runtime fallback never fires, INGESTOR_TAG="0.3". OK. * helm upgrade --reuse-values from a pre-v1.4.1 stored manifest: the stored values lack `images.ingestor.tag` entirely. Runtime fallback fires, renders INGESTOR_TAG="prod", which 404s on ghcr.io because that tag doesn't exist. Ingestor refresh silently no-ops every tick. Failure mode is graceful (log warning, no crash), but inconsistent with the per-customer expectation that v1.4.1 enables ingestor auto-refresh. autoUpgrade itself uses --reset-then-reuse-values, so this only hits manual --reuse-values upgrades — narrow but real. Fix: change runtime fallback to "0.3" so both render paths converge. Regression test simulates the --reuse-values scenario by setting images.ingestor.tag=null, exercising the runtime fallback. 142/142 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
7 tasks
This was referenced Jun 3, 2026
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
Extends the image-refresh CronJob (shipped in #155 / v1.4.0) to also reconcile the
ghcr.io/tracebloc/ingestordigest onto the live jobs-manager deployment'sINGESTOR_IMAGE_DIGESTenv var. New ingestor image publishes to the floating tag are now picked up within ~15 min — no chart release required for ingestor image bumps.Closes #158.
Design
Two image classes in one CronJob:
CLIENT_ENVtracebloc.io/last-refreshed-<image>-digestannotationkubectl rollout restart(#155)images.ingestor.tag(defaultprod)INGESTOR_IMAGE_DIGESTenv value on the api containerkubectl set env(triggers natural rollout)Class 2 needs no annotation because the env IS the digest jobs-manager passes to each spawned ingestion Job — the most direct read of "what will be used next" is THIS value.
get_token+get_latest_digestparameterized by registry. Both docker.io and ghcr.io support anonymous pull tokens for public images.Per-image opt-out
images.<image>.digestnon-empty.images.ingestor.autoRefresh: falseflag. Asymmetric becauseingestor.digestmust be non-empty for jobs-manager to function (empty would cause 503s on every ingestion submit), so we can't use digest-presence as the signal.When all three pin signals are active, the chart renders no image-refresh resources.
Subtle gotcha caught in dev
default true $autoRefreshreturnstrueeven when$autoRefreshis explicitlyfalse(Go treats boolfalseas falsy, sodefaultoverrides). Switched toeq $autoRefresh falsedirectly; regression test pinned.Test plan
helm lintcleanhelm templaterenders correctly across pin combinations (default + all-three-pinned + only-jobs-manager-pinned + autoRefresh=false)helm unittest: 20 image-refresh tests (was 17), 140 total ✅traceblocrelease to 1.4.1, trigger image-refresh, verify ingestor env updates without rollout-restart actionChart version
Bumped 1.4.0 → 1.4.1.
🤖 Generated with Claude Code
Note
Medium Risk
Automated patches to the jobs-manager deployment and ingestion image env affect every ingestion submit; logic is intricate (rollout ordering, env vs annotation drift) though scope stays namespace-scoped RBAC on deployments.
Overview
Extends the image-refresh CronJob so ghcr.io/tracebloc/ingestor digest changes on a floating tag are reconciled onto the jobs-manager deployment’s
INGESTOR_IMAGE_DIGESTenv (viakubectl set envon theapicontainer), so new ingestor publishes can land without a chart release.Enablement / opt-out:
tracebloc.imageRefreshEnablednow requires all three managed images to be “pinned” before the CronJob is omitted—jobs-manager and pods-monitor still use non-emptydigest, ingestor usesimages.ingestor.autoRefresh: false(because ingestordigestmust stay set). Helm useseq … falseinstead ofdefault trueso explicitautoRefresh: falseis honored.Script: Registry helpers support docker.io and ghcr.io. Pass 1 unchanged in spirit (annotations + rollout restart). Pass 2 adds annotation
tracebloc.io/last-refreshed-ingestor-digestas the success record (annotate only afterrollout status), first-tick adopt-or-fill behavior, and spec-env drift recovery when the registry matches but env was reverted externally.Values / schema:
images.ingestor.tag(defaultprod, rejectslatest) andimages.ingestor.autoRefresh; values comments describe chart baseline digest vs ongoing refresh. Helm unittest coverage expanded for ingestor paths and pin combinations.Reviewed by Cursor Bugbot for commit 5be26c8. Bugbot is set up for automated code reviews on this repo. Configure here.