Skip to content

test(requests-proxy): add helm-unittest coverage for requests-proxy Deployment#194

Merged
saadqbal merged 1 commit into
developfrom
tests/coverage-requests-proxy-20260604
Jun 4, 2026
Merged

test(requests-proxy): add helm-unittest coverage for requests-proxy Deployment#194
saadqbal merged 1 commit into
developfrom
tests/coverage-requests-proxy-20260604

Conversation

@saadqbal
Copy link
Copy Markdown
Contributor

@saadqbal saadqbal commented Jun 3, 2026

What

Adds a helm-unittest suite for templates/requests-proxy-deployment.yaml, the only data-plane workload template with no unit test. Part of #193.

Why

The requests-proxy Deployment carries the chart's security-context invariants and a single-worker constraint that are costly to regress silently. This suite pins them:

  • Security context (see docs/SECURITY.md): no SA-token automount, runAsNonRoot, seccomp RuntimeDefault, runAsUser: 1001, allowPrivilegeEscalation: false, capabilities.drop: [ALL], readOnlyRootFilesystem: true.
  • Single replica + single gunicorn worker — the pod token registry is process-local; >1 worker silently shards token lookups.
  • Image source docker.io/tracebloc/jobs-manager and container port 8888.
  • Nil-guarded resources — the default requests/limits render, plus an override case that exercises the default-through-dict fallthrough (guards the historic readOnlyRootFilesystem: trueresources: newline-eating regression).

Coverage delta

templates/requests-proxy-deployment.yaml: no suite → 11 assertions / 11 cases. Untested chart templates: 7 → 6.

Verification

$ helm unittest ./client
Test Suites: 15 passed, 15 total
Tests:       153 passed, 153 total

(14 suites / 142 tests before this change — no existing suites touched.)


This is the first PR from a test-coverage dry-run (the scheduled "daily coverage PR" routine couldn't be registered — claude.ai scheduler was unreachable). It previews what that automation will produce per run.

🤖 Generated with Claude Code


Note

Low Risk
Test-only change with no production template or application code modifications.

Overview
Adds a new helm-unittest suite (client/tests/requests_proxy_test.yaml) for templates/requests-proxy-deployment.yaml, which previously had no unit tests.

The suite locks in rendered behavior with 11 test cases: Deployment identity/labels, exactly one replica, no service account token automount, pod and container security hardening (non-root, seccomp, capabilities drop, read-only root FS), jobs-manager image on docker.io, port 8888, gunicorn --workers=1, default CPU/memory requests and limits, and a values override path through the nil-guarded resources.requestsProxy block (regression guard for the historic YAML newline bug).

No Helm template or runtime code is modified—test-only coverage for a data-plane workload.

Reviewed by Cursor Bugbot for commit a50417a. Bugbot is set up for automated code reviews on this repo. Configure here.

…eployment

requests-proxy-deployment.yaml was the only data-plane workload template
without a unit test. This suite pins the properties most costly to regress:

- security-context invariants (no SA-token automount, runAsNonRoot,
  seccomp RuntimeDefault, runAsUser 1001, no privilege escalation,
  drop ALL caps, read-only root filesystem) — see docs/SECURITY.md
- the single-replica / single gunicorn worker constraint (the pod token
  registry is process-local; >1 worker silently shards token lookups)
- the docker.io/tracebloc/jobs-manager image source and port 8888
- the nil-guarded resource defaults, plus an override case that exercises
  the default-through-dict fallthrough (guards the historic
  `readOnlyRootFilesystem: trueresources:` newline-eating regression)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@saadqbal saadqbal added the auto-coverage Automated test-coverage PRs label Jun 3, 2026
@saadqbal saadqbal self-assigned this Jun 3, 2026
@LukasWodka
Copy link
Copy Markdown
Contributor

👋 Heads-up — Code review queue is at 14 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

@saadqbal saadqbal merged commit 1809872 into develop Jun 4, 2026
15 checks passed
saadqbal added a commit that referenced this pull request Jun 4, 2026
* fix(resource-monitor): always grant read-only ClusterRole (decouple from clusterScope)

Under clusterScope: false the chart rendered only a namespace-scoped Role in
the release namespace. But the resource-monitor's code:
  * calls core_v1_api.list_pod_for_all_namespaces(field_selector=spec.nodeName=...)
    -- a CLUSTER-SCOPED list verb a namespaced Role can never satisfy; and
  * read_namespaced_pod()s its OWN pod, which lives in
    .Values.nodeAgents.namespace.name (NOT .Release.Namespace).

So with clusterScope: false the DaemonSet 403'd on startup and crashlooped
(70+ restarts observed on a live cluster). Per-node monitoring is intrinsically
cluster-scoped.

Always render the read-only ClusterRole + ClusterRoleBinding regardless of
clusterScope (get/list/watch on pods/nodes/namespaces + metrics; no write,
exec, or secret access). resourceMonitor: false still fully disables the
component. clusterScope continues to gate the training/jobs isolation footprint
elsewhere -- it must not leave the node monitor without permissions it cannot
run without.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* test(resource-monitor): assert always-cluster-scoped RBAC under clusterScope=false

Follow-up to the RBAC fix: node_agents_namespace_test.yaml still asserted the
old behavior (namespaced Role + RoleBinding in the release namespace when
clusterScope=false). Update that case to assert the corrected contract -- a
ClusterRole + ClusterRoleBinding always render (with no metadata.namespace),
while the subject SA still lives in the node-agents namespace.

The clusterScope=false path stays under test; only the asserted behavior
changes to match the fix. Verified with `helm unittest` (all resource-monitor
suites pass).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(rbac): grant `get` on configmaps/secrets to jobs-manager SA

The ingestion endpoint's orphan-resource verify path (client-runtime#52)
and missing-row self-heal (client-runtime#54) read the existing
ConfigMap/Secret on a create-409 to confirm content matches before
reuse. The Role/ClusterRole only granted `create`, so those reads
returned Forbidden and the endpoint 500'd instead of the intended
409/200-replay — verified live on the dev cluster.

Add `get` alongside `create` in both the ClusterRole (clusterScope:
true) and namespace Role (clusterScope: false) branches.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* ci(helm): guard that the pinned ingestor digest is multi-arch (closes #186) (#187)

Add a helm-ci job (ingestor-multiarch) that parses images.ingestor.digest and
fails the build unless it's a multi-arch index (linux/amd64 + linux/arm64).

Greenfield installs spawn the ingestor Job from this PINNED digest before
image-refresh first ticks, so an amd64-only pin breaks data ingestion on arm64
hosts (Apple Silicon, Graviton) with "no match for platform" / ImagePullBackOff.
This would have caught #160 (the amd64-only v0.3.1 pin) before it shipped.

ghcr.io/tracebloc/ingestor is public -> no secrets. Verified: passes on the
current multi-arch baseline (sha256:d361fa77, v0.3.2 / #184), fails on the old
amd64-only sha256:a0861ea9.

Note: the digest is already multi-arch on develop as of v0.3.2 (#184 — the same
d361fa77 index this PR previously bumped to), so #187 no longer touches
values.yaml; it adds only the regression guard so an amd64-only pin can't slip
back in.

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

* fix(#190): fail image-refresh loudly when the ingestor (ghcr) digest can't resolve (#191)

image-refresh silently skipped every tick when get_latest_digest returned
empty for the ghcr.io ingestor image (egress/proxy/firewall to ghcr.io, or a
blocked token endpoint) — never reaching the registry-drift branch that sets
the new digest. jobs-manager + pods-monitor pull from docker.io and refreshed
fine, so the CronJob looked healthy while the ingestor digest stayed pinned on
the install-time baseline. That's why the berlin-team arm64 install sat on the
amd64-only v0.3.1 digest even after :0.3 went multi-arch (#186 follow-up #2).

Now count consecutive ingestor-resolve failures on a deployment annotation:
- below imageRefresh.ingestorResolveFailureThreshold (default 3, ~45 min at the
  15-min schedule) -> WARN + skip, as before (tolerate transient blips);
- at/above it -> ERROR with actionable guidance, a
  tracebloc.io/ingestor-refresh-last-error annotation, and a non-zero exit so
  the Job fails visibly in `kubectl get cronjob` / monitoring — the same
  surfacing idiom Pass 2's stuck-rollout check already relies on;
- a successful resolve clears the streak.

Threshold is nil-guarded (default 3) for --reuse-values upgrades and
schema-validated (integer >= 1). The digest-resolution logic itself is
unchanged (verified correct: it returns the multi-arch index digest).

helm unittest 146/146, helm lint clean, shellcheck + sh -n clean.

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

* test(requests-proxy): add helm-unittest coverage for requests-proxy Deployment (#194)

requests-proxy-deployment.yaml was the only data-plane workload template
without a unit test. This suite pins the properties most costly to regress:

- security-context invariants (no SA-token automount, runAsNonRoot,
  seccomp RuntimeDefault, runAsUser 1001, no privilege escalation,
  drop ALL caps, read-only root filesystem) — see docs/SECURITY.md
- the single-replica / single gunicorn worker constraint (the pod token
  registry is process-local; >1 worker silently shards token lookups)
- the docker.io/tracebloc/jobs-manager image source and port 8888
- the nil-guarded resource defaults, plus an override case that exercises
  the default-through-dict fallthrough (guards the historic
  `readOnlyRootFilesystem: trueresources:` newline-eating regression)

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

* fix(#196): allow training-pod egress to the requests-proxy (8888) (#197)

The training-egress NetworkPolicy denies all pod-to-pod / ClusterIP
egress (rule 2 excepts the cluster CIDRs) and re-permits only MySQL
(rule 3). When the requests-proxy architecture shipped — training pods
POST epoch results / FLOPs to requests-proxy-service:8888 instead of
holding Service Bus credentials — this template was never updated to
re-permit egress to the proxy. Result on every install with the policy
enabled: pods hit "requests-proxy-service:8888 ... [Errno 111]
Connection refused" at the first epoch finalize → CrashLoopBackOff →
all experiments fail.

Add rule 4 mirroring the MySQL rule: TCP/8888 to podSelector
app=requests-proxy (same namespace). Service selector + port from
templates/requests-proxy-service.yaml.

Verified: `helm template -f ci/bm-values.yaml --show-only
templates/network-policy-training.yaml` renders the new rule as valid
YAML.

Found live on a fresh client (tracebloc-amazon / k3d): jobs-manager
reached the proxy (HTTP 401) while training pods got connection-refused
— the only differentiator was this egress policy. Interim: live-patched
the cluster + suspended its auto-upgrade CronJob (so reuse-values
wouldn't revert the patch); re-enable once this lands + releases.

Closes #196.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Installer UX: drop PriorityClass, fix namespace, one-per-machine guard, surface version (#192)

* fix(chart): drop the data-plane PriorityClass by default

The cluster-scoped, fixed-name `tracebloc-data-plane` PriorityClass was the
only thing forcing one tracebloc client per cluster (a second release collided
on it with a cryptic Helm error) and blocking multiple tracebloc namespaces in
one BYO cluster. mysql doesn't need it: memory requests==limits (last evicted
under memory pressure), data on a PVC (eviction = transient restart, not data
loss), and a PDB guards voluntary disruptions. Its only unique benefit was
letting the scheduler preempt training jobs to keep mysql scheduled on a packed
node — a narrow case.

Default priorityClass.create=false + name="" so new installs template no
PriorityClass and mysql carries no priorityClassName. Opt back in
(create:true + name) on contended clusters, or reference an out-of-band one
(create:false + name:<existing>). helm-unittest updated; 144/144 pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(installer): fixed namespace + one-per-machine guard; drop workspace prompt

The "Choose a workspace name" prompt asked the user to invent a label that
isn't their identity (the backend identifies a client by its credentials, not
this string — it's just the local k8s namespace / Helm release name; the
installer even discards the auth response body). It defaulted to a meaningless
"default" and was the field that collided on a second install.

- Drop the prompt; TB_NAMESPACE defaults to a fixed "tracebloc"
  (env-overridable for advanced/GitOps setups).
- One-client-per-machine guard: after credentials verify, compare the entered
  Client ID against any client already installed here (helm get values). Same
  ID = a normal re-run/upgrade; a DIFFERENT ID hard-blocks with an explanation
  and options (repair / switch via `k3d cluster delete` / use another machine)
  instead of silently re-pointing the machine. This replaces the accidental
  PriorityClass collision (now dropped) with an intentional, explained guard.
- Document the TB_NAMESPACE override; update bats (input sequences + 2 new
  guard tests). bats 26/27 — the 1 failure is a pre-existing macOS-bash-3.2
  quirk in _extract_yaml_value, unrelated (CI bash 5 passes it).

NOTE: install-k8s.ps1 + its Pester tests still need the same mirror (follow-up).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(installer-ps): mirror fixed namespace + one-per-machine guard (PowerShell)

Mirrors the bash change in install-k8s.ps1:
- drop the "Choose a workspace name" prompt; TB_NAMESPACE defaults to a fixed
  "tracebloc" (override via $env:TB_NAMESPACE).
- one-client-per-machine guard: after credentials verify, compare the entered
  Client ID against any client already installed here (helm get values); a
  different ID hard-blocks with the same explanation/options as bash.
- Pester: 2 new guard tests (block-different / allow-same). The existing
  Install-ClientHelm tests use dispatch-by-prompt Read-Host mocks, so the
  prompt removal doesn't disturb them.

No pwsh locally -> verified via CI (Pester ubuntu+windows + PSScriptAnalyzer).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(installer): scan all namespaces in the one-per-machine guard

The guard checked only the `tracebloc` namespace, so a client installed by an
older installer version (default namespace `default`, or a custom name) wasn't
detected -- a re-run could create a second coexisting client. Now enumerate all
client-chart releases (helm list -A) and compare each one's clientId, covering
both fresh and migrated installs. bash uses jq (already a dependency; falls
back to the tracebloc namespace if absent); PowerShell uses ConvertFrom-Json.
The block message names the namespace. bats + Pester guard tests updated.

Verified: bats green locally (jq path); PowerShell via CI Pester.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(installer): show client (chart) version in summary + --diagnose

Users had no easy way to see which client version they're on (the CLI isn't
shipped yet; `helm list` needs the namespace, and nothing surfaced it). Show
the chart version where they already look:
- install summary: a "Version" line next to Workspace.
- --diagnose: as the first console line + recorded in the bundle header
  (the #1 thing support needs).

Adds a best-effort `_chart_version` / Get-ChartVersion helper (greps helm's
CHART column -> no jq). bash + PowerShell; bats + Pester coverage added.

Verified: summary.bats + diagnose.bats green locally; ps1 via CI Pester.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

* chore: bump chart 1.4.4 → 1.4.5 to ship the training-egress proxy fix (#198)

1.4.4 is already published on the tracebloc.github.io/client Pages
channel and is what clusters run. The training-egress NetworkPolicy fix
(#197, allow training → requests-proxy:8888) merged to develop without a
version bump, so it is currently undeliverable: chart-releaser won't
overwrite the existing 1.4.4 release, and clusters already on 1.4.4 would
see no version change and pull nothing.

Bump to 1.4.5 (lockstep version/appVersion, matching 1.4.3/1.4.4 history)
so a v1.4.5 release publishes a new version that auto-upgrade actually
pulls. Chart-only change; no image change.

Ref #196 / #197.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: shujaat hasan <shujaathasan@shujaats-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Co-authored-by: shujaat_tracebloc <153823837+shujaatTracebloc@users.noreply.github.com>
Co-authored-by: lukasWuttke <54042461+LukasWodka@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-coverage Automated test-coverage PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants