Skip to content

chore: bump chart 1.4.4 → 1.4.5 (ships #197 training-egress proxy fix)#198

Merged
saadqbal merged 1 commit into
developfrom
chore/bump-chart-1.4.5
Jun 4, 2026
Merged

chore: bump chart 1.4.4 → 1.4.5 (ships #197 training-egress proxy fix)#198
saadqbal merged 1 commit into
developfrom
chore/bump-chart-1.4.5

Conversation

@saadqbal
Copy link
Copy Markdown
Contributor

@saadqbal saadqbal commented Jun 4, 2026

Why

1.4.4 is already published on the tracebloc.github.io/client Pages channel and is what clusters run. The training-egress NetworkPolicy fix (#197 — allow training pods → requests-proxy:8888) merged to develop without a version bump, so it's currently undeliverable:

  • chart-releaser won't overwrite the existing 1.4.4 release, and
  • clusters already on 1.4.4 see no version change → auto-upgrade pulls nothing.

What

Bump Chart.yaml 1.4.4 → 1.4.5 (lockstep version/appVersion, matching the 1.4.3/1.4.4 history). Chart-only change; no image change.

After this merges

Cut a v1.4.5 GitHub Release → the release-helm-chart workflow packages + publishes it → every client's auto-upgrade pulls it at :23 and self-heals onto the chart-managed NetworkPolicy rule. (Fleet-wide — release-owner's call.) Then re-enable the interim-patched cluster's suspended auto-upgrade CronJob.

Ref #196 / #197.

🤖 Generated with Claude Code


Note

Low Risk
Single metadata bump in Chart.yaml with no runtime or template changes in this PR.

Overview
Version-only release so the already-merged training-egress NetworkPolicy fix (#197) can ship via Helm: Chart.yaml version and appVersion move in lockstep from 1.4.4 to 1.4.5. No template or image changes in this diff.

Because 1.4.4 is already on the chart repo, chart-releaser cannot republish that tag and clusters on 1.4.4 see no upgrade path for the proxy egress rule. A new chart version is required for auto-upgrade to pick up the chart-managed allow for training pods → requests-proxy:8888.

Reviewed by Cursor Bugbot for commit c7bbff1. Bugbot is set up for automated code reviews on this repo. Configure here.

1.4.4 is already published on the tracebloc.github.io/client Pages
channel and is what clusters run. The training-egress NetworkPolicy fix
(#197, allow training → requests-proxy:8888) merged to develop without a
version bump, so it is currently undeliverable: chart-releaser won't
overwrite the existing 1.4.4 release, and clusters already on 1.4.4 would
see no version change and pull nothing.

Bump to 1.4.5 (lockstep version/appVersion, matching 1.4.3/1.4.4 history)
so a v1.4.5 release publishes a new version that auto-upgrade actually
pulls. Chart-only change; no image change.

Ref #196 / #197.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@LukasWodka
Copy link
Copy Markdown
Contributor

👋 Heads-up — Code review queue is at 16 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

@saadqbal saadqbal self-assigned this Jun 4, 2026
@saadqbal saadqbal merged commit ddb233a into develop Jun 4, 2026
15 checks passed
saadqbal added a commit that referenced this pull request Jun 4, 2026
* fix(resource-monitor): always grant read-only ClusterRole (decouple from clusterScope)

Under clusterScope: false the chart rendered only a namespace-scoped Role in
the release namespace. But the resource-monitor's code:
  * calls core_v1_api.list_pod_for_all_namespaces(field_selector=spec.nodeName=...)
    -- a CLUSTER-SCOPED list verb a namespaced Role can never satisfy; and
  * read_namespaced_pod()s its OWN pod, which lives in
    .Values.nodeAgents.namespace.name (NOT .Release.Namespace).

So with clusterScope: false the DaemonSet 403'd on startup and crashlooped
(70+ restarts observed on a live cluster). Per-node monitoring is intrinsically
cluster-scoped.

Always render the read-only ClusterRole + ClusterRoleBinding regardless of
clusterScope (get/list/watch on pods/nodes/namespaces + metrics; no write,
exec, or secret access). resourceMonitor: false still fully disables the
component. clusterScope continues to gate the training/jobs isolation footprint
elsewhere -- it must not leave the node monitor without permissions it cannot
run without.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* test(resource-monitor): assert always-cluster-scoped RBAC under clusterScope=false

Follow-up to the RBAC fix: node_agents_namespace_test.yaml still asserted the
old behavior (namespaced Role + RoleBinding in the release namespace when
clusterScope=false). Update that case to assert the corrected contract -- a
ClusterRole + ClusterRoleBinding always render (with no metadata.namespace),
while the subject SA still lives in the node-agents namespace.

The clusterScope=false path stays under test; only the asserted behavior
changes to match the fix. Verified with `helm unittest` (all resource-monitor
suites pass).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(rbac): grant `get` on configmaps/secrets to jobs-manager SA

The ingestion endpoint's orphan-resource verify path (client-runtime#52)
and missing-row self-heal (client-runtime#54) read the existing
ConfigMap/Secret on a create-409 to confirm content matches before
reuse. The Role/ClusterRole only granted `create`, so those reads
returned Forbidden and the endpoint 500'd instead of the intended
409/200-replay — verified live on the dev cluster.

Add `get` alongside `create` in both the ClusterRole (clusterScope:
true) and namespace Role (clusterScope: false) branches.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* ci(helm): guard that the pinned ingestor digest is multi-arch (closes #186) (#187)

Add a helm-ci job (ingestor-multiarch) that parses images.ingestor.digest and
fails the build unless it's a multi-arch index (linux/amd64 + linux/arm64).

Greenfield installs spawn the ingestor Job from this PINNED digest before
image-refresh first ticks, so an amd64-only pin breaks data ingestion on arm64
hosts (Apple Silicon, Graviton) with "no match for platform" / ImagePullBackOff.
This would have caught #160 (the amd64-only v0.3.1 pin) before it shipped.

ghcr.io/tracebloc/ingestor is public -> no secrets. Verified: passes on the
current multi-arch baseline (sha256:d361fa77, v0.3.2 / #184), fails on the old
amd64-only sha256:a0861ea9.

Note: the digest is already multi-arch on develop as of v0.3.2 (#184 — the same
d361fa77 index this PR previously bumped to), so #187 no longer touches
values.yaml; it adds only the regression guard so an amd64-only pin can't slip
back in.

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

* fix(#190): fail image-refresh loudly when the ingestor (ghcr) digest can't resolve (#191)

image-refresh silently skipped every tick when get_latest_digest returned
empty for the ghcr.io ingestor image (egress/proxy/firewall to ghcr.io, or a
blocked token endpoint) — never reaching the registry-drift branch that sets
the new digest. jobs-manager + pods-monitor pull from docker.io and refreshed
fine, so the CronJob looked healthy while the ingestor digest stayed pinned on
the install-time baseline. That's why the berlin-team arm64 install sat on the
amd64-only v0.3.1 digest even after :0.3 went multi-arch (#186 follow-up #2).

Now count consecutive ingestor-resolve failures on a deployment annotation:
- below imageRefresh.ingestorResolveFailureThreshold (default 3, ~45 min at the
  15-min schedule) -> WARN + skip, as before (tolerate transient blips);
- at/above it -> ERROR with actionable guidance, a
  tracebloc.io/ingestor-refresh-last-error annotation, and a non-zero exit so
  the Job fails visibly in `kubectl get cronjob` / monitoring — the same
  surfacing idiom Pass 2's stuck-rollout check already relies on;
- a successful resolve clears the streak.

Threshold is nil-guarded (default 3) for --reuse-values upgrades and
schema-validated (integer >= 1). The digest-resolution logic itself is
unchanged (verified correct: it returns the multi-arch index digest).

helm unittest 146/146, helm lint clean, shellcheck + sh -n clean.

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

* test(requests-proxy): add helm-unittest coverage for requests-proxy Deployment (#194)

requests-proxy-deployment.yaml was the only data-plane workload template
without a unit test. This suite pins the properties most costly to regress:

- security-context invariants (no SA-token automount, runAsNonRoot,
  seccomp RuntimeDefault, runAsUser 1001, no privilege escalation,
  drop ALL caps, read-only root filesystem) — see docs/SECURITY.md
- the single-replica / single gunicorn worker constraint (the pod token
  registry is process-local; >1 worker silently shards token lookups)
- the docker.io/tracebloc/jobs-manager image source and port 8888
- the nil-guarded resource defaults, plus an override case that exercises
  the default-through-dict fallthrough (guards the historic
  `readOnlyRootFilesystem: trueresources:` newline-eating regression)

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

* fix(#196): allow training-pod egress to the requests-proxy (8888) (#197)

The training-egress NetworkPolicy denies all pod-to-pod / ClusterIP
egress (rule 2 excepts the cluster CIDRs) and re-permits only MySQL
(rule 3). When the requests-proxy architecture shipped — training pods
POST epoch results / FLOPs to requests-proxy-service:8888 instead of
holding Service Bus credentials — this template was never updated to
re-permit egress to the proxy. Result on every install with the policy
enabled: pods hit "requests-proxy-service:8888 ... [Errno 111]
Connection refused" at the first epoch finalize → CrashLoopBackOff →
all experiments fail.

Add rule 4 mirroring the MySQL rule: TCP/8888 to podSelector
app=requests-proxy (same namespace). Service selector + port from
templates/requests-proxy-service.yaml.

Verified: `helm template -f ci/bm-values.yaml --show-only
templates/network-policy-training.yaml` renders the new rule as valid
YAML.

Found live on a fresh client (tracebloc-amazon / k3d): jobs-manager
reached the proxy (HTTP 401) while training pods got connection-refused
— the only differentiator was this egress policy. Interim: live-patched
the cluster + suspended its auto-upgrade CronJob (so reuse-values
wouldn't revert the patch); re-enable once this lands + releases.

Closes #196.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Installer UX: drop PriorityClass, fix namespace, one-per-machine guard, surface version (#192)

* fix(chart): drop the data-plane PriorityClass by default

The cluster-scoped, fixed-name `tracebloc-data-plane` PriorityClass was the
only thing forcing one tracebloc client per cluster (a second release collided
on it with a cryptic Helm error) and blocking multiple tracebloc namespaces in
one BYO cluster. mysql doesn't need it: memory requests==limits (last evicted
under memory pressure), data on a PVC (eviction = transient restart, not data
loss), and a PDB guards voluntary disruptions. Its only unique benefit was
letting the scheduler preempt training jobs to keep mysql scheduled on a packed
node — a narrow case.

Default priorityClass.create=false + name="" so new installs template no
PriorityClass and mysql carries no priorityClassName. Opt back in
(create:true + name) on contended clusters, or reference an out-of-band one
(create:false + name:<existing>). helm-unittest updated; 144/144 pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(installer): fixed namespace + one-per-machine guard; drop workspace prompt

The "Choose a workspace name" prompt asked the user to invent a label that
isn't their identity (the backend identifies a client by its credentials, not
this string — it's just the local k8s namespace / Helm release name; the
installer even discards the auth response body). It defaulted to a meaningless
"default" and was the field that collided on a second install.

- Drop the prompt; TB_NAMESPACE defaults to a fixed "tracebloc"
  (env-overridable for advanced/GitOps setups).
- One-client-per-machine guard: after credentials verify, compare the entered
  Client ID against any client already installed here (helm get values). Same
  ID = a normal re-run/upgrade; a DIFFERENT ID hard-blocks with an explanation
  and options (repair / switch via `k3d cluster delete` / use another machine)
  instead of silently re-pointing the machine. This replaces the accidental
  PriorityClass collision (now dropped) with an intentional, explained guard.
- Document the TB_NAMESPACE override; update bats (input sequences + 2 new
  guard tests). bats 26/27 — the 1 failure is a pre-existing macOS-bash-3.2
  quirk in _extract_yaml_value, unrelated (CI bash 5 passes it).

NOTE: install-k8s.ps1 + its Pester tests still need the same mirror (follow-up).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(installer-ps): mirror fixed namespace + one-per-machine guard (PowerShell)

Mirrors the bash change in install-k8s.ps1:
- drop the "Choose a workspace name" prompt; TB_NAMESPACE defaults to a fixed
  "tracebloc" (override via $env:TB_NAMESPACE).
- one-client-per-machine guard: after credentials verify, compare the entered
  Client ID against any client already installed here (helm get values); a
  different ID hard-blocks with the same explanation/options as bash.
- Pester: 2 new guard tests (block-different / allow-same). The existing
  Install-ClientHelm tests use dispatch-by-prompt Read-Host mocks, so the
  prompt removal doesn't disturb them.

No pwsh locally -> verified via CI (Pester ubuntu+windows + PSScriptAnalyzer).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(installer): scan all namespaces in the one-per-machine guard

The guard checked only the `tracebloc` namespace, so a client installed by an
older installer version (default namespace `default`, or a custom name) wasn't
detected -- a re-run could create a second coexisting client. Now enumerate all
client-chart releases (helm list -A) and compare each one's clientId, covering
both fresh and migrated installs. bash uses jq (already a dependency; falls
back to the tracebloc namespace if absent); PowerShell uses ConvertFrom-Json.
The block message names the namespace. bats + Pester guard tests updated.

Verified: bats green locally (jq path); PowerShell via CI Pester.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(installer): show client (chart) version in summary + --diagnose

Users had no easy way to see which client version they're on (the CLI isn't
shipped yet; `helm list` needs the namespace, and nothing surfaced it). Show
the chart version where they already look:
- install summary: a "Version" line next to Workspace.
- --diagnose: as the first console line + recorded in the bundle header
  (the #1 thing support needs).

Adds a best-effort `_chart_version` / Get-ChartVersion helper (greps helm's
CHART column -> no jq). bash + PowerShell; bats + Pester coverage added.

Verified: summary.bats + diagnose.bats green locally; ps1 via CI Pester.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

* chore: bump chart 1.4.4 → 1.4.5 to ship the training-egress proxy fix (#198)

1.4.4 is already published on the tracebloc.github.io/client Pages
channel and is what clusters run. The training-egress NetworkPolicy fix
(#197, allow training → requests-proxy:8888) merged to develop without a
version bump, so it is currently undeliverable: chart-releaser won't
overwrite the existing 1.4.4 release, and clusters already on 1.4.4 would
see no version change and pull nothing.

Bump to 1.4.5 (lockstep version/appVersion, matching 1.4.3/1.4.4 history)
so a v1.4.5 release publishes a new version that auto-upgrade actually
pulls. Chart-only change; no image change.

Ref #196 / #197.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: shujaat hasan <shujaathasan@shujaats-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Co-authored-by: shujaat_tracebloc <153823837+shujaatTracebloc@users.noreply.github.com>
Co-authored-by: lukasWuttke <54042461+LukasWodka@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants