Skip to content

fix(#154): read annotations via jq (kubectl jsonpath bracket notation broken on dotted keys)#156

Merged
saadqbal merged 1 commit into
developfrom
fix/154-annotation-read-via-jq
May 25, 2026
Merged

fix(#154): read annotations via jq (kubectl jsonpath bracket notation broken on dotted keys)#156
saadqbal merged 1 commit into
developfrom
fix/154-annotation-read-via-jq

Conversation

@saadqbal
Copy link
Copy Markdown
Contributor

@saadqbal saadqbal commented May 25, 2026

Summary

Follow-up to #155 — bug caught in the dev-cluster smoke test on tb-client-dev-templates.

kubectl get -o jsonpath=\"{.metadata.annotations['key.with.dots']}\" returns empty when the key contains . or /, on kubectl-go 1.30.x (alpine/k8s 1.30.5). Verified directly: a separate kubectl get -o jsonpath='{.metadata.annotations}' showed the annotation present and correctly persisted, but the script's bracket-notation read returned empty.

Effect: every image-refresh tick saw recorded=<unset> and re-entered the first-observation path. The deployment was repeatedly re-annotated with the same digest on every run. No spurious restarts (first-observation skips restart), but the script could never transition to the "unchanged; no-op" path, and a real image update would have been indistinguishable from first observation.

Fix

Switch get_annotation to kubectl get -o json | jq -r --arg k \"\$_key\" '.metadata.annotations[\$k] // empty'. alpine/k8s ships jq so no new dependency. jq's behaviour is locked against the kubectl-go jsonpath flakiness around dotted keys.

Regression tests

  • Positive: script must include jq -r --arg k
  • Negative: script must NOT include annotations['\$_key'] bracket-notation jsonpath

136/136 unit tests pass.

Test plan

  • Confirmed annotation persistence via direct kubectl get on dev cluster
  • Re-upgrade dev release, re-run image-refresh CronJob, verify second tick logs "digest unchanged since last refresh; no-op"

🤖 Generated with Claude Code


Note

Medium Risk
Changes how the auto-refresh CronJob decides no-op vs rollout by fixing annotation reads; narrow script change with tests, but wrong reads previously blocked correct drift detection.

Overview
Fixes the image-refresh CronJob so it can read deployment annotations whose keys contain dots (e.g. tracebloc.io/last-refreshed-*-digest). get_annotation now uses kubectl get … -o json piped to jq instead of kubectl jsonpath with annotations['$_key'], which returned empty on alpine/k8s 1.30.x for those keys.

Without this, every tick treated the stored digest as unset, stayed on the first-observation path (re-annotate, no restart), and never reached digest unchanged; no-op—so a real registry update would not be compared correctly against the last recorded value.

Helm unit tests assert the script includes jq -r --arg k and must not regress to bracket-notation jsonpath for annotation reads.

Reviewed by Cursor Bugbot for commit 36134d7. Bugbot is set up for automated code reviews on this repo. Configure here.

…tion

Caught in dev-cluster smoke test on tb-client-dev-templates.

`kubectl get -o jsonpath="{.metadata.annotations['key.with.dots']}"`
returns empty for keys containing `.` or `/` on kubectl-go 1.30.x —
verified directly: `kubectl get ... -o jsonpath='{.metadata.annotations}'`
showed `tracebloc.io/last-refreshed-jobs-manager-digest:sha256:f913...`
present and persisted, but the script's bracket-notation read returned
empty. Effect: every tick saw `recorded=<unset>` and re-entered the
first-observation path, re-annotating the deployment with the same
digest on every run. No spurious restarts (first-observation skips
restart), but the script was incapable of ever transitioning to the
"unchanged; no-op" path, and a real image update would have been
indistinguishable from first observation.

Fix: read via `kubectl get -o json | jq -r --arg k "$_key"
'.metadata.annotations[$k] // empty'`. alpine/k8s ships jq, so this
adds no new dependency. The dot-escape jsonpath form
(`.tracebloc\.io/...`) also works but is fragile against future
kubectl version changes; jq's behaviour is locked.

Regression tests:
* The script must include `jq -r --arg k` (positive match).
* The script must NOT include `annotations['$_key']` bracket-notation
  jsonpath (negative match — the regression vector itself).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@saadqbal saadqbal self-assigned this May 25, 2026
@LukasWodka
Copy link
Copy Markdown
Contributor

👋 Heads-up — Code review queue is at 17 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

@waqaskhanroghani waqaskhanroghani self-requested a review May 25, 2026 09:04
@saadqbal saadqbal merged commit cb151f7 into develop May 25, 2026
4 checks passed
@saadqbal saadqbal deleted the fix/154-annotation-read-via-jq branch May 25, 2026 09:09
@saadqbal
Copy link
Copy Markdown
Contributor Author

/fr-pass

Dev smoke-tested as part of the same session — second-tick now correctly reads back the annotation written on the first tick (recorded=sha256:f913…) and reports digest unchanged since last refresh; no-op. This was the exact regression motivating #156. See PR #157 description for the full verification table.

saadqbal added a commit that referenced this pull request May 25, 2026
…159)

* feat(#158): auto-refresh ingestor image digest without chart release

Extends the existing image-refresh CronJob (#155) to also reconcile
the ghcr.io/tracebloc/ingestor digest onto the live jobs-manager
deployment's INGESTOR_IMAGE_DIGEST env var. New ingestor image
publishes to the floating tag are now picked up within the cronjob's
poll interval (~15 min) instead of requiring a full chart release.

Why

Today, shipping a new ghcr.io/tracebloc/ingestor image required
bumping client/values.yaml images.ingestor.digest + client/Chart.yaml
+ PR + sync to main + release tag. That's hours of overhead per
bump and asymmetric with jobs-manager (which already gets the
~15-min image-refresh path). The asymmetry hurts because the
ingestor changes frequently as the data-ingestors team iterates.

Design

Two image classes in one CronJob now:

  Class 1 (jobs-manager, pods-monitor):
    Registry: docker.io
    Tag: CLIENT_ENV
    Source of truth: deployment annotation
      `tracebloc.io/last-refreshed-<image>-digest` (#154)
    Action on change: kubectl rollout restart

  Class 2 (ingestor):
    Registry: ghcr.io
    Tag: images.ingestor.tag (default "prod")
    Source of truth: live INGESTOR_IMAGE_DIGEST env value on the
      api container of the jobs-manager deployment (no annotation
      needed — the env IS the digest jobs-manager passes to each
      spawned ingestion Job, so the most direct read of "what
      will be used next" is THIS value).
    Action on change: kubectl set env (triggers natural rollout
      via ReplicaSet rotation — no explicit `rollout restart`).

get_token + get_latest_digest parameterized by registry; both
docker.io and ghcr.io support anonymous pull tokens for public
images with only the issuer URL differing.

Per-image opt-out

* jobs-manager / pods-monitor: same as #154 — set
  `images.<image>.digest` non-empty.
* ingestor: explicit `images.ingestor.autoRefresh: false` flag.
  Asymmetric because ingestor.digest must be non-empty for
  jobs-manager to function (an empty env would 503 every ingestion
  submit), so we can't use digest-presence as the signal.

When ALL THREE pin signals are active, the chart renders no
image-refresh resources at all (helper `imageRefreshEnabled`).
When at least one is unpinned, the cronjob is rendered and the
script skips pinned images via env flags at runtime.

Chart-default ingestor digest stays pinned (v0.3.0) as the
baseline for greenfield installs; image-refresh dynamically
updates the live env from there. Helm's 3-way merge preserves
image-refresh's writes across future helm upgrades as long as
the chart's pinned baseline doesn't change.

Subtle gotcha caught in dev

`default true $autoRefresh` in Go templates returns `true` even
when $autoRefresh is explicitly `false` (Go treats bool false as
falsy, so default overrides it). Switched to `eq $autoRefresh
false` directly — absence (nil) and explicit `true` both fall
through to "not pinned" as intended. Test pinned against the
correct idiom.

Other changes

* `log()` continues to write to stderr (#155 fix).
* `get_container_env` helper for jq-based env-var reads —
  same kubectl-jsonpath caveat as `get_annotation` (#156).
* Chart version bumped 1.4.0 → 1.4.1.

Tests

20 image-refresh-suite tests (was 17), 140 total pass. New
assertions:
  * all-three-pinned renders zero resources
  * only-jobs-manager+pods-monitor-pinned still renders (regression
    guard for the asymmetric pin signal — without this, the
    ingestor would never auto-refresh on default installs)
  * INGESTOR_PINNED flips correctly on autoRefresh=false
  * INGESTOR_TAG is overridable, `latest` rejected by schema
  * Script must include `kubectl set env`, `ghcr.io`,
    `auth.docker.io`, `get_container_env`, the empty-env
    fill-from-registry path, and the autoRefresh-skip log line

Closes #158

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(#158): annotation as source of truth for ingestor (rollout-failure retry)

Caught in PR #159 review (bugbot, medium severity).

The original design used the live spec env (`INGESTOR_IMAGE_DIGEST` on
the api container) as the source of truth for "what image-refresh has
reconciled to." `kubectl set env` commits the new spec to etcd BEFORE
`kubectl rollout status` waits for the rollout to complete. If the
rollout times out or the new ReplicaSet's pods fail to come up:

  * `set -eu` aborts the script.
  * But the spec already matches the registry.
  * Next tick: `get_container_env` returns the new digest, compares
    equal to registry, no-op → script appears successful.
  * Meanwhile the old ReplicaSet's pods are still running with the
    OLD env, and the new ReplicaSet is stuck failing. The deployment
    is frozen on the old version with no retry signal.

Fix: mirror the jobs-manager/pods-monitor pattern from #154. Use a
`tracebloc.io/last-refreshed-ingestor-digest` annotation on the
deployment as the source of truth. Update the annotation as the LAST
step, only after `rollout status` succeeds. A failed rollout aborts
before the annotate → next tick sees stale annotation → retries.

First-observation contract for ingestor:
  * Non-empty spec env (the normal case — chart populates a default):
    adopt as baseline annotation, don't touch env. Same "don't churn
    on install" principle as jobs-manager first-observation.
  * Empty spec env (corrupted state, manual kubectl edit, stale
    --reuse-values): fill from registry on first tick. Empty would
    otherwise cause jobs-manager to 503 on every ingestion submit,
    so the "don't churn" trade is wrong in that case.

Tests pin:
  * Annotation key `tracebloc.io/last-refreshed-ingestor-digest`
    appears in the script.
  * Order-of-operations: set env → rollout status → annotate (the
    annotate MUST come last; regex matches the full sequence).

140/140 tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(#158): correct stale top-of-script comment + add first-obs test

Found during PR #159 self-review.

The top-of-script comment block still described the pre-bugbot-fix
design — claimed "Source of truth: the live env value itself (no
annotation needed)" and "no 'first observation' empty-state, each
tick is a normal compare-and-patch-if-different." Both were stale
after e7cf829 switched ingestor to annotation-based source of truth
and added the first-observation branch. Anyone reading the
script-level overview would have been misled about the actual loop.

Comment now matches the code: annotation as source of truth, two-
case first-observation contract (non-empty → adopt as baseline;
empty → fill from registry).

Also adds a positive regression test for the previously-untested
first-observation "adopting spec env as baseline" branch. The
empty-spec-env branch was already covered indirectly by the existing
"would 503 on ingestion submit" regex.

140/140 tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(#158): also reconcile env drift (rollout undo / kubectl edit / GitOps)

Caught in PR #159 review (bugbot, medium severity). Follow-up to the
annotation-source-of-truth switch in e7cf829.

The previous no-op branch fired whenever annotation == registry and
NEVER read the live spec env. That meant any external actor that
reverted the spec env without touching the annotation would leave the
deployment on a stale env indefinitely. The annotation continued to
match the registry, so image-refresh kept skipping. Real scenarios
this affects:

  * `kubectl rollout undo deployment/X` — reverts pod template to a
    previous ReplicaSet's spec, including its INGESTOR_IMAGE_DIGEST
    env. Annotation on deployment metadata is untouched.
  * `kubectl edit deployment X` — operator manually changes the env.
  * Certain `helm upgrade` flag combos can reset env to the chart's
    pre-image-refresh baseline while preserving annotations (e.g.,
    --reset-values or upgrade from a chart where the digest baseline
    differs from what image-refresh had reconciled to).
  * GitOps reconcilers (Argo CD, Flux) that own the deployment spec
    will revert image-refresh's env writes back to the rendered
    template values.

In all of these, the live deployment runs a stale ingestor image
forever — exactly the failure mode #158 was meant to prevent.

Fix: each tick now reads both the annotation AND the live spec env.
Three reconciliation paths:

  * recorded != registry → "registry drift". Set env to registry,
    wait for rollout, update annotation. (Existing behaviour.)
  * recorded == registry AND spec env != recorded → "env drift". Set
    env to recorded value (NOT registry — registry matches recorded
    by definition here, but recorded is the value we last decided to
    roll to). Wait for rollout. Don't update the annotation; it's
    already correct.
  * recorded == registry AND spec env == recorded → fully in sync,
    no-op.

Tests pin:
  * The "spec env drifted" log line.
  * The drift-recovery branch sets env to `${recorded_ingestor}`,
    not `${latest_ingestor}` (different from the registry-drift
    branch). Regex catches the variable used in `INGESTOR_IMAGE_DIGEST=`.

Top-of-script comment block updated to document the drift recovery.
140/140 tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(#158): wait for rollout status in adopt-as-baseline branch

Caught in PR #159 review (bugbot, high severity).

Scenario the existing code mishandles:
  1. Tick N: empty-spec-first-obs branch runs `kubectl set env`
     (commits new spec to etcd) → `kubectl rollout status` times out
     → `set -eu` aborts before the annotate. Annotation stays empty.
  2. Tick N+1: annotation still empty. spec env is now non-empty
     (the failed-rollout's spec change persists). get_container_env
     returns that value, so the adopt-as-baseline branch fires.
  3. Adopt-as-baseline only annotates — it never checks rollout
     health. Annotation records "we're at D1" while running pods
     are still on the old/empty env from before tick N.

The deployment now appears reconciled (annotation == registry on
subsequent ticks) while actually being stuck on the wrong image.

Fix: call `kubectl rollout status` inside the adopt-as-baseline
branch before the annotate. On a healthy deployment it returns
near-instantly; on a stuck rollout from a previous failed
set-env it times out, `set -eu` aborts before the annotate, next
tick retries. No latency cost on the happy path.

Regression test pins the (?s)-multiline order:
adopting → rollout status → annotate.

140/140 tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
saadqbal added a commit that referenced this pull request May 26, 2026
* Merge pull request #161 from tracebloc/chore/pin-ingestor-v0.3.1-160

chore: pin client chart's ingestor digest to v0.3.1

* feat(#158): auto-refresh ingestor image digest without chart release (#159)

* feat(#158): auto-refresh ingestor image digest without chart release

Extends the existing image-refresh CronJob (#155) to also reconcile
the ghcr.io/tracebloc/ingestor digest onto the live jobs-manager
deployment's INGESTOR_IMAGE_DIGEST env var. New ingestor image
publishes to the floating tag are now picked up within the cronjob's
poll interval (~15 min) instead of requiring a full chart release.

Why

Today, shipping a new ghcr.io/tracebloc/ingestor image required
bumping client/values.yaml images.ingestor.digest + client/Chart.yaml
+ PR + sync to main + release tag. That's hours of overhead per
bump and asymmetric with jobs-manager (which already gets the
~15-min image-refresh path). The asymmetry hurts because the
ingestor changes frequently as the data-ingestors team iterates.

Design

Two image classes in one CronJob now:

  Class 1 (jobs-manager, pods-monitor):
    Registry: docker.io
    Tag: CLIENT_ENV
    Source of truth: deployment annotation
      `tracebloc.io/last-refreshed-<image>-digest` (#154)
    Action on change: kubectl rollout restart

  Class 2 (ingestor):
    Registry: ghcr.io
    Tag: images.ingestor.tag (default "prod")
    Source of truth: live INGESTOR_IMAGE_DIGEST env value on the
      api container of the jobs-manager deployment (no annotation
      needed — the env IS the digest jobs-manager passes to each
      spawned ingestion Job, so the most direct read of "what
      will be used next" is THIS value).
    Action on change: kubectl set env (triggers natural rollout
      via ReplicaSet rotation — no explicit `rollout restart`).

get_token + get_latest_digest parameterized by registry; both
docker.io and ghcr.io support anonymous pull tokens for public
images with only the issuer URL differing.

Per-image opt-out

* jobs-manager / pods-monitor: same as #154 — set
  `images.<image>.digest` non-empty.
* ingestor: explicit `images.ingestor.autoRefresh: false` flag.
  Asymmetric because ingestor.digest must be non-empty for
  jobs-manager to function (an empty env would 503 every ingestion
  submit), so we can't use digest-presence as the signal.

When ALL THREE pin signals are active, the chart renders no
image-refresh resources at all (helper `imageRefreshEnabled`).
When at least one is unpinned, the cronjob is rendered and the
script skips pinned images via env flags at runtime.

Chart-default ingestor digest stays pinned (v0.3.0) as the
baseline for greenfield installs; image-refresh dynamically
updates the live env from there. Helm's 3-way merge preserves
image-refresh's writes across future helm upgrades as long as
the chart's pinned baseline doesn't change.

Subtle gotcha caught in dev

`default true $autoRefresh` in Go templates returns `true` even
when $autoRefresh is explicitly `false` (Go treats bool false as
falsy, so default overrides it). Switched to `eq $autoRefresh
false` directly — absence (nil) and explicit `true` both fall
through to "not pinned" as intended. Test pinned against the
correct idiom.

Other changes

* `log()` continues to write to stderr (#155 fix).
* `get_container_env` helper for jq-based env-var reads —
  same kubectl-jsonpath caveat as `get_annotation` (#156).
* Chart version bumped 1.4.0 → 1.4.1.

Tests

20 image-refresh-suite tests (was 17), 140 total pass. New
assertions:
  * all-three-pinned renders zero resources
  * only-jobs-manager+pods-monitor-pinned still renders (regression
    guard for the asymmetric pin signal — without this, the
    ingestor would never auto-refresh on default installs)
  * INGESTOR_PINNED flips correctly on autoRefresh=false
  * INGESTOR_TAG is overridable, `latest` rejected by schema
  * Script must include `kubectl set env`, `ghcr.io`,
    `auth.docker.io`, `get_container_env`, the empty-env
    fill-from-registry path, and the autoRefresh-skip log line

Closes #158

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(#158): annotation as source of truth for ingestor (rollout-failure retry)

Caught in PR #159 review (bugbot, medium severity).

The original design used the live spec env (`INGESTOR_IMAGE_DIGEST` on
the api container) as the source of truth for "what image-refresh has
reconciled to." `kubectl set env` commits the new spec to etcd BEFORE
`kubectl rollout status` waits for the rollout to complete. If the
rollout times out or the new ReplicaSet's pods fail to come up:

  * `set -eu` aborts the script.
  * But the spec already matches the registry.
  * Next tick: `get_container_env` returns the new digest, compares
    equal to registry, no-op → script appears successful.
  * Meanwhile the old ReplicaSet's pods are still running with the
    OLD env, and the new ReplicaSet is stuck failing. The deployment
    is frozen on the old version with no retry signal.

Fix: mirror the jobs-manager/pods-monitor pattern from #154. Use a
`tracebloc.io/last-refreshed-ingestor-digest` annotation on the
deployment as the source of truth. Update the annotation as the LAST
step, only after `rollout status` succeeds. A failed rollout aborts
before the annotate → next tick sees stale annotation → retries.

First-observation contract for ingestor:
  * Non-empty spec env (the normal case — chart populates a default):
    adopt as baseline annotation, don't touch env. Same "don't churn
    on install" principle as jobs-manager first-observation.
  * Empty spec env (corrupted state, manual kubectl edit, stale
    --reuse-values): fill from registry on first tick. Empty would
    otherwise cause jobs-manager to 503 on every ingestion submit,
    so the "don't churn" trade is wrong in that case.

Tests pin:
  * Annotation key `tracebloc.io/last-refreshed-ingestor-digest`
    appears in the script.
  * Order-of-operations: set env → rollout status → annotate (the
    annotate MUST come last; regex matches the full sequence).

140/140 tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(#158): correct stale top-of-script comment + add first-obs test

Found during PR #159 self-review.

The top-of-script comment block still described the pre-bugbot-fix
design — claimed "Source of truth: the live env value itself (no
annotation needed)" and "no 'first observation' empty-state, each
tick is a normal compare-and-patch-if-different." Both were stale
after e7cf829 switched ingestor to annotation-based source of truth
and added the first-observation branch. Anyone reading the
script-level overview would have been misled about the actual loop.

Comment now matches the code: annotation as source of truth, two-
case first-observation contract (non-empty → adopt as baseline;
empty → fill from registry).

Also adds a positive regression test for the previously-untested
first-observation "adopting spec env as baseline" branch. The
empty-spec-env branch was already covered indirectly by the existing
"would 503 on ingestion submit" regex.

140/140 tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(#158): also reconcile env drift (rollout undo / kubectl edit / GitOps)

Caught in PR #159 review (bugbot, medium severity). Follow-up to the
annotation-source-of-truth switch in e7cf829.

The previous no-op branch fired whenever annotation == registry and
NEVER read the live spec env. That meant any external actor that
reverted the spec env without touching the annotation would leave the
deployment on a stale env indefinitely. The annotation continued to
match the registry, so image-refresh kept skipping. Real scenarios
this affects:

  * `kubectl rollout undo deployment/X` — reverts pod template to a
    previous ReplicaSet's spec, including its INGESTOR_IMAGE_DIGEST
    env. Annotation on deployment metadata is untouched.
  * `kubectl edit deployment X` — operator manually changes the env.
  * Certain `helm upgrade` flag combos can reset env to the chart's
    pre-image-refresh baseline while preserving annotations (e.g.,
    --reset-values or upgrade from a chart where the digest baseline
    differs from what image-refresh had reconciled to).
  * GitOps reconcilers (Argo CD, Flux) that own the deployment spec
    will revert image-refresh's env writes back to the rendered
    template values.

In all of these, the live deployment runs a stale ingestor image
forever — exactly the failure mode #158 was meant to prevent.

Fix: each tick now reads both the annotation AND the live spec env.
Three reconciliation paths:

  * recorded != registry → "registry drift". Set env to registry,
    wait for rollout, update annotation. (Existing behaviour.)
  * recorded == registry AND spec env != recorded → "env drift". Set
    env to recorded value (NOT registry — registry matches recorded
    by definition here, but recorded is the value we last decided to
    roll to). Wait for rollout. Don't update the annotation; it's
    already correct.
  * recorded == registry AND spec env == recorded → fully in sync,
    no-op.

Tests pin:
  * The "spec env drifted" log line.
  * The drift-recovery branch sets env to `${recorded_ingestor}`,
    not `${latest_ingestor}` (different from the registry-drift
    branch). Regex catches the variable used in `INGESTOR_IMAGE_DIGEST=`.

Top-of-script comment block updated to document the drift recovery.
140/140 tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(#158): wait for rollout status in adopt-as-baseline branch

Caught in PR #159 review (bugbot, high severity).

Scenario the existing code mishandles:
  1. Tick N: empty-spec-first-obs branch runs `kubectl set env`
     (commits new spec to etcd) → `kubectl rollout status` times out
     → `set -eu` aborts before the annotate. Annotation stays empty.
  2. Tick N+1: annotation still empty. spec env is now non-empty
     (the failed-rollout's spec change persists). get_container_env
     returns that value, so the adopt-as-baseline branch fires.
  3. Adopt-as-baseline only annotates — it never checks rollout
     health. Annotation records "we're at D1" while running pods
     are still on the old/empty env from before tick N.

The deployment now appears reconciled (annotation == registry on
subsequent ticks) while actually being stuck on the wrong image.

Fix: call `kubectl rollout status` inside the adopt-as-baseline
branch before the annotate. On a healthy deployment it returns
near-instantly; on a stuck rollout from a previous failed
set-env it times out, `set -eu` aborts before the annotate, next
tick retries. No latency cost on the happy path.

Regression test pins the (?s)-multiline order:
adopting → rollout status → annotate.

140/140 tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* fix(#158): two bugbot follow-ups + chart default tag

Caught in PR #162 review (bugbot, two medium-severity issues).

1. Env-drift rollout retry gap

   The no-op branch (annotation == registry AND spec env == recorded)
   was a bare log statement with no rollout-health verification. A
   previous tick's env-drift `kubectl set env` commits its spec change
   to etcd BEFORE `kubectl rollout status` waits for the new
   ReplicaSet to come up. If the rollout fails, `set -eu` aborts —
   but the spec write persists. Next tick: annotation, registry, and
   spec env all match (because the spec write committed), so the
   no-op branch fires and silently masks the stuck rollout. Running
   pods may be on the old or empty INGESTOR_IMAGE_DIGEST while the
   script reports success.

   Fix: call `kubectl rollout status` in the no-op branch too. On a
   healthy deployment it returns near-instantly (no active rollout
   to wait for). On a stuck deployment it times out, set -eu aborts,
   and the Job is visibly failed in `kubectl get cronjob`. The
   operator then sees the stuck state and can investigate. Image-
   refresh can't autonomously recover from a bad image push, but
   making the failure visible is the right behaviour.

2. Default ingestor tag mismatched team's publishing convention

   Chart defaulted `images.ingestor.tag: prod`. The team's
   ghcr.io/tracebloc/ingestor repo uses semver-style float tags
   (`0`, `0.3`) — there is no `prod` tag. Default install would
   silently no-op every tick because manifest resolution 404'd:

     curl ... ghcr.io/v2/.../manifests/prod → 404
     log "  WARN: could not resolve latest digest; skipping"

   The whole ingestor auto-refresh feature wouldn't work for any
   customer running the chart's defaults, despite `autoRefresh:
   true`.

   Fix: changed default to "0.3" (conservative — patch-only auto-
   track; won't pick up a future 0.4 with breaking changes).
   Operators can override to "0" if they want major-version
   auto-tracking. Long-term, the team should consider standardising
   the chart default once the data-ingestors release-image.yml
   formalises its tag-publishing contract — for now this matches
   what we tested with on the dev cluster.

Regression tests:
  * Default tag asserted as "0.3" with `notContains` of "prod" to
    guard against silent revert.
  * No-op branch asserted to call `kubectl rollout status` via
    (?s)-multiline regex matching the "verifying deployment health"
    log line + the kubectl rollout status call.
  * Existing test updated from value: prod to value: "0.3".

141/141 unit tests pass.

NB: these commits are landing on the sync branch directly to avoid
another full develop-PR cycle before release. After #162 merges,
the same content will need to flow back to develop — either via a
"sync main → develop" PR or by cherry-picking the two commits. The
divergence is two commits and is easy to resolve.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(#158): align INGESTOR_TAG runtime fallback with chart default

Caught in PR #162 review (bugbot, medium severity).

The chart default was changed from "prod" to "0.3" in the values
block — matches the team's ghcr.io publishing convention — but the
CronJob template's runtime fallback was left at `| default "prod"`.
Two render paths:

  * helm install / helm upgrade --reset-then-reuse-values: the
    chart's new default ("0.3") flows through, runtime fallback
    never fires, INGESTOR_TAG="0.3". OK.
  * helm upgrade --reuse-values from a pre-v1.4.1 stored manifest:
    the stored values lack `images.ingestor.tag` entirely. Runtime
    fallback fires, renders INGESTOR_TAG="prod", which 404s on
    ghcr.io because that tag doesn't exist. Ingestor refresh
    silently no-ops every tick.

Failure mode is graceful (log warning, no crash), but inconsistent
with the per-customer expectation that v1.4.1 enables ingestor
auto-refresh. autoUpgrade itself uses --reset-then-reuse-values, so
this only hits manual --reuse-values upgrades — narrow but real.

Fix: change runtime fallback to "0.3" so both render paths converge.

Regression test simulates the --reuse-values scenario by setting
images.ingestor.tag=null, exercising the runtime fallback. 142/142
tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants