fix(taskruntime): fetch Workflow UID from apiserver (downward-API can't project it)#334
Conversation
…'t project it) First end-to-end manual fire of the release-test Workflow surfaced two real bugs: 1. **keygen-admin Task pod failed at startup with** `seitask: infra: downward-API env not projected: [SEI_WORKFLOW_NAME SEI_WORKFLOW_UID SEI_NAMESPACE]`. Root cause: chaos-mesh stamps `chaos-mesh.org/workflow=<name>` label on Task pods but does NOT stamp the workflow UID anywhere. The taskruntime/ownerref.go comment claiming chaos-mesh writes UID to an annotation was aspirational, not observed. 2. **provision-validator-chain Task pod stuck in CreateContainerConfigError** waiting for the workflow-vars ConfigMap that keygen-admin would have created had it not failed. Downstream effect of (1). Fix in two parts: **Go (this PR):** - `taskruntime.LoadWorkflowIdentity` now takes `(ctx, client.Client)`. Reads NAME + NAMESPACE from env (projected via downward API by the scenario YAML), then fetches the Workflow CR's UID from the apiserver using NAME + NAMESPACE. SEI_WORKFLOW_UID env still works as a short-circuit override for tests. - Subcommand callers (keygen, provision-snd, upload-report) reorder client construction before identity loading so the identity lookup has a client to work with. - New test cases: apiserver-lookup path (UID from CR), workflow-not- found path (Infra error), env-short-circuit (UID env wins). **Scenario YAML (this PR):** - `scenarios/release-test.yaml`'s four seitask Task containers now project SEI_WORKFLOW_NAME via fieldRef on `metadata.labels['chaos-mesh.org/workflow']` and SEI_NAMESPACE via `metadata.namespace`. UID flows from the apiserver lookup. The runner subcommand is unaffected — it doesn't call LoadWorkflowIdentity (verified). major-upgrade.yaml's runner Tasks therefore don't need a downward-API update. Validated against the actual first-fire failure mode: kubectl apply --dry-run=server passes; tests cover both paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR SummaryMedium Risk Overview Updates Reviewed by Cursor Bugbot for commit a605b8e. Bugbot is set up for automated code reviews on this repo. Configure here. |
… convention (#337) Second bug from the post-#334 manual fire. keygen-admin creates the workflow-vars CM via taskruntime.EnsureWorkflowVarsCM which derives the CM name from the FULL Workflow CR name (release-test-${RUN_ID}). The scenario YAML's three envFrom references used just ${SEI_WORKFLOW_RUN_ID} (no scenario prefix), so the kubelet looked for a CM that didn't exist and provision-validator-chain's pod stuck in CreateContainerConfigError indefinitely. Fix: align all three envFrom references to workflow-vars-release-test-${SEI_WORKFLOW_RUN_ID}, matching what keygen actually creates (and the admin-secret reference at line 155 already uses the same workflow-name-based convention). Convention: workflow-vars-<workflow-cr-name>. Sourced from taskruntime.WorkflowVarsName which has no scenario knowledge and only takes the Workflow CR name. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cross-review feedback from platform-engineer + kubernetes-specialist on the #339 chain of fixes: both reviewers independently noted that we've been discovering contract drift between the seitask binary internals and the scenario YAML / RBAC layer at first-fire instead of at build time. Each first-fire bug (#334, #337, #339) is the same shape: an internal helper has a convention, the scenario author has to mirror it manually, no test catches the drift. Two narrow guards land here, ranked by ROI: - TestTaskScheme_RoundTripsSND / _RoundTripsSeiNodeTask: would have caught the #339 scheme-registration bug at `go test`. Validates that the package-level taskScheme actually has every sei.io/v1alpha1 type the seitask subcommands construct via typed Create/Get. - TestScenarioYAMLs_CMNameMatchesWorkflowVarsName: would have caught the #337 CM-name drift at `go test`. Walks every scenario YAML in the opt-in allow list (release-test.yaml today), extracts the Workflow CR's metadata.name, asserts every envFrom configMapRef.name matches WorkflowVarsName(metadataName). Major-upgrade is excluded — its CM is bash-created with a different convention; revisit when the half-bash legacy retires. Defers (filed/tracked separately, not in scope for this PR): - RBAC vs kubebuilder-marker reconciliation test (kubernetes-specialist ranked #3; defer until a third recurrence). - Wrapper SA workflows: [patch] prereq for #340 path 1 (amend on #340). - EXIT_REASON write-once-or-fail-classification semantics for #340 (amend on #340). - Scenario contract enforcement subcommand + SEI_WORKFLOW_VARS_CM env approach (file new issue). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(seitask): register sei.io scheme + grant workflownodes RBAC Third manual fire of release-test surfaced two more contract bugs: 1. provision-validator-chain + provision-rpc-fleet failed at Create time with `no kind is registered for the type v1alpha1.SeiNodeDeployment in scheme`. cmd/seitask/main.go's kubeClientFromEnv built a controller-runtime client with client-go's built-in scheme (K8s types only); sei.io/v1alpha1 was never registered. Fix: local taskScheme registering builtin + sei.io/v1alpha1. 2. upload-report failed listing workflownodes: 403 from runner/rbac.yaml granting workflows (get) but not workflownodes. Fix: add workflownodes (get, list). Run-release-test's "Base URL is missing a protocol" was a downstream symptom — provision-snd never published endpoints to workflow-vars, so $(RPC_TM_RPC) couldn't resolve and the literal string passed through to the release-test image. Resolves automatically once #1 is fixed. Separately: Chaos Mesh Serial does NOT fail-fast on child Task errors — each WorkflowNode transitions to Accomplished=True on pod termination regardless of exit code, and Serial proceeds to the next child. Filed as separate follow-up; tracked as architectural concern, not in scope for this fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(seitask): build-time guards for scheme + scenario CM-name contracts Cross-review feedback from platform-engineer + kubernetes-specialist on the #339 chain of fixes: both reviewers independently noted that we've been discovering contract drift between the seitask binary internals and the scenario YAML / RBAC layer at first-fire instead of at build time. Each first-fire bug (#334, #337, #339) is the same shape: an internal helper has a convention, the scenario author has to mirror it manually, no test catches the drift. Two narrow guards land here, ranked by ROI: - TestTaskScheme_RoundTripsSND / _RoundTripsSeiNodeTask: would have caught the #339 scheme-registration bug at `go test`. Validates that the package-level taskScheme actually has every sei.io/v1alpha1 type the seitask subcommands construct via typed Create/Get. - TestScenarioYAMLs_CMNameMatchesWorkflowVarsName: would have caught the #337 CM-name drift at `go test`. Walks every scenario YAML in the opt-in allow list (release-test.yaml today), extracts the Workflow CR's metadata.name, asserts every envFrom configMapRef.name matches WorkflowVarsName(metadataName). Major-upgrade is excluded — its CM is bash-created with a different convention; revisit when the half-bash legacy retires. Defers (filed/tracked separately, not in scope for this PR): - RBAC vs kubebuilder-marker reconciliation test (kubernetes-specialist ranked #3; defer until a third recurrence). - Wrapper SA workflows: [patch] prereq for #340 path 1 (amend on #340). - EXIT_REASON write-once-or-fail-classification semantics for #340 (amend on #340). - Scenario contract enforcement subcommand + SEI_WORKFLOW_VARS_CM env approach (file new issue). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
First manual fire of the new release-test Workflow on harbor (post sei-protocol/platform#627 merge) surfaced two real bugs in the seitask + scenario YAML contract. Both are fixed here.
Bug 1 (the one we hit):
seitask: infra: downward-API env not projected: [SEI_WORKFLOW_NAME SEI_WORKFLOW_UID SEI_NAMESPACE]`keygen-admin` Task pod exited at startup. Root cause: `taskruntime.LoadWorkflowIdentity` expected three env vars projected via Kubernetes downward API, but the release-test scenario YAML's Task containers had no env block.
Worse: when we tried to add the downward-API projections, we discovered chaos-mesh stamps `chaos-mesh.org/workflow=` label on Task pods but does NOT stamp the Workflow CR's UID anywhere. The taskruntime/ownerref.go comment claiming chaos-mesh writes the UID to an annotation was aspirational; the reality on a v2.8.0 Task pod is no UID label, no UID annotation.
Bug 2 (downstream): `provision-validator-chain` stuck in `CreateContainerConfigError`
The Task pod referenced `workflow-vars-` ConfigMap via envFrom; that CM is created by keygen-admin. Because keygen-admin failed, the CM never existed and the kubelet couldn't start the container. Resolves automatically once keygen-admin succeeds.
Fix
`taskruntime.LoadWorkflowIdentity` signature change — now `(ctx, client.Client)` instead of `()`. Reads NAME + NAMESPACE from env (downward-API projectable), then fetches the Workflow CR's UID from the apiserver using NAME + NAMESPACE. `SEI_WORKFLOW_UID` env still works as a short-circuit (tests rely on it).
Subcommand callers (keygen, provision-snd, upload-report) reorder client construction before identity loading.
`scenarios/release-test.yaml` — every seitask Task container projects:
```yaml
env:
valueFrom:
fieldRef:
fieldPath: metadata.labels['chaos-mesh.org/workflow']
valueFrom:
fieldRef:
fieldPath: metadata.namespace
```
UID flows from the apiserver lookup in LoadWorkflowIdentity.
Runner subcommand unaffected — it doesn't call LoadWorkflowIdentity (verified). `major-upgrade.yaml` therefore doesn't need a scenario YAML update.
Test plan
🤖 Generated with Claude Code