fix(taskruntime): fetch Workflow UID from apiserver (downward-API can't project it) by bdchatham · Pull Request #334 · sei-protocol/sei-k8s-controller

bdchatham · 2026-05-21T17:15:42Z

Summary

First manual fire of the new release-test Workflow on harbor (post sei-protocol/platform#627 merge) surfaced two real bugs in the seitask + scenario YAML contract. Both are fixed here.

Bug 1 (the one we hit): `seitask: infra: downward-API env not projected: [SEI_WORKFLOW_NAME SEI_WORKFLOW_UID SEI_NAMESPACE]`

`keygen-admin` Task pod exited at startup. Root cause: `taskruntime.LoadWorkflowIdentity` expected three env vars projected via Kubernetes downward API, but the release-test scenario YAML's Task containers had no env block.

Worse: when we tried to add the downward-API projections, we discovered chaos-mesh stamps `chaos-mesh.org/workflow=` label on Task pods but does NOT stamp the Workflow CR's UID anywhere. The taskruntime/ownerref.go comment claiming chaos-mesh writes the UID to an annotation was aspirational; the reality on a v2.8.0 Task pod is no UID label, no UID annotation.

Bug 2 (downstream): `provision-validator-chain` stuck in `CreateContainerConfigError`

The Task pod referenced `workflow-vars-` ConfigMap via envFrom; that CM is created by keygen-admin. Because keygen-admin failed, the CM never existed and the kubelet couldn't start the container. Resolves automatically once keygen-admin succeeds.

Fix

`taskruntime.LoadWorkflowIdentity` signature change — now `(ctx, client.Client)` instead of `()`. Reads NAME + NAMESPACE from env (downward-API projectable), then fetches the Workflow CR's UID from the apiserver using NAME + NAMESPACE. `SEI_WORKFLOW_UID` env still works as a short-circuit (tests rely on it).

Subcommand callers (keygen, provision-snd, upload-report) reorder client construction before identity loading.

`scenarios/release-test.yaml` — every seitask Task container projects:
```yaml
env:

name: SEI_WORKFLOW_NAME
valueFrom:
fieldRef:
fieldPath: metadata.labels['chaos-mesh.org/workflow']
name: SEI_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
```

UID flows from the apiserver lookup in LoadWorkflowIdentity.

Runner subcommand unaffected — it doesn't call LoadWorkflowIdentity (verified). `major-upgrade.yaml` therefore doesn't need a scenario YAML update.

Test plan

`go test ./internal/taskruntime/...` passes including new apiserver-lookup test case (also covers workflow-not-found → Infra error, and env-UID short-circuit)
`go test ./...` passes
`golangci-lint run` clean
`scenarios/release-test.yaml` validates via `kubectl apply --dry-run=server --validate=strict` against the chaos-mesh.org Workflow CRD on harbor
After merge + image build + SCENARIO_REF bump in sei-protocol/platform's workflow.yaml: manually fire `release-test-workflow` CronJob and verify keygen-admin, provision-validator-chain, provision-rpc-fleet, run-release-test, upload-report all complete

🤖 Generated with Claude Code

…'t project it) First end-to-end manual fire of the release-test Workflow surfaced two real bugs: 1. **keygen-admin Task pod failed at startup with** `seitask: infra: downward-API env not projected: [SEI_WORKFLOW_NAME SEI_WORKFLOW_UID SEI_NAMESPACE]`. Root cause: chaos-mesh stamps `chaos-mesh.org/workflow=<name>` label on Task pods but does NOT stamp the workflow UID anywhere. The taskruntime/ownerref.go comment claiming chaos-mesh writes UID to an annotation was aspirational, not observed. 2. **provision-validator-chain Task pod stuck in CreateContainerConfigError** waiting for the workflow-vars ConfigMap that keygen-admin would have created had it not failed. Downstream effect of (1). Fix in two parts: **Go (this PR):** - `taskruntime.LoadWorkflowIdentity` now takes `(ctx, client.Client)`. Reads NAME + NAMESPACE from env (projected via downward API by the scenario YAML), then fetches the Workflow CR's UID from the apiserver using NAME + NAMESPACE. SEI_WORKFLOW_UID env still works as a short-circuit override for tests. - Subcommand callers (keygen, provision-snd, upload-report) reorder client construction before identity loading so the identity lookup has a client to work with. - New test cases: apiserver-lookup path (UID from CR), workflow-not- found path (Infra error), env-short-circuit (UID env wins). **Scenario YAML (this PR):** - `scenarios/release-test.yaml`'s four seitask Task containers now project SEI_WORKFLOW_NAME via fieldRef on `metadata.labels['chaos-mesh.org/workflow']` and SEI_NAMESPACE via `metadata.namespace`. UID flows from the apiserver lookup. The runner subcommand is unaffected — it doesn't call LoadWorkflowIdentity (verified). major-upgrade.yaml's runner Tasks therefore don't need a downward-API update. Validated against the actual first-fire failure mode: kubectl apply --dry-run=server passes; tests cover both paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cursor · 2026-05-21T17:15:49Z

PR Summary

Medium Risk
Changes how seitask discovers its parent Workflow identity by introducing an apiserver lookup and updating scenario env projections; failures here would break task startup/ownerRef stamping across workflows.

Overview
Fixes seitask startup/cleanup wiring by changing taskruntime.LoadWorkflowIdentity to take (ctx, client) and fetch the parent Workflow UID from the Kubernetes apiserver when SEI_WORKFLOW_UID is not provided (still supported as a short-circuit).

Updates keygen, provision-snd, and upload-report to build the kube client before loading workflow identity, adjusts tests accordingly, and updates scenarios/release-test.yaml so each seitask Task container projects SEI_WORKFLOW_NAME and SEI_NAMESPACE via downward API.

^{Reviewed by Cursor Bugbot for commit a605b8e. Bugbot is set up for automated code reviews on this repo. Configure here.}

… convention (#337) Second bug from the post-#334 manual fire. keygen-admin creates the workflow-vars CM via taskruntime.EnsureWorkflowVarsCM which derives the CM name from the FULL Workflow CR name (release-test-${RUN_ID}). The scenario YAML's three envFrom references used just ${SEI_WORKFLOW_RUN_ID} (no scenario prefix), so the kubelet looked for a CM that didn't exist and provision-validator-chain's pod stuck in CreateContainerConfigError indefinitely. Fix: align all three envFrom references to workflow-vars-release-test-${SEI_WORKFLOW_RUN_ID}, matching what keygen actually creates (and the admin-secret reference at line 155 already uses the same workflow-name-based convention). Convention: workflow-vars-<workflow-cr-name>. Sourced from taskruntime.WorkflowVarsName which has no scenario knowledge and only takes the Workflow CR name. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Cross-review feedback from platform-engineer + kubernetes-specialist on the #339 chain of fixes: both reviewers independently noted that we've been discovering contract drift between the seitask binary internals and the scenario YAML / RBAC layer at first-fire instead of at build time. Each first-fire bug (#334, #337, #339) is the same shape: an internal helper has a convention, the scenario author has to mirror it manually, no test catches the drift. Two narrow guards land here, ranked by ROI: - TestTaskScheme_RoundTripsSND / _RoundTripsSeiNodeTask: would have caught the #339 scheme-registration bug at `go test`. Validates that the package-level taskScheme actually has every sei.io/v1alpha1 type the seitask subcommands construct via typed Create/Get. - TestScenarioYAMLs_CMNameMatchesWorkflowVarsName: would have caught the #337 CM-name drift at `go test`. Walks every scenario YAML in the opt-in allow list (release-test.yaml today), extracts the Workflow CR's metadata.name, asserts every envFrom configMapRef.name matches WorkflowVarsName(metadataName). Major-upgrade is excluded — its CM is bash-created with a different convention; revisit when the half-bash legacy retires. Defers (filed/tracked separately, not in scope for this PR): - RBAC vs kubebuilder-marker reconciliation test (kubernetes-specialist ranked #3; defer until a third recurrence). - Wrapper SA workflows: [patch] prereq for #340 path 1 (amend on #340). - EXIT_REASON write-once-or-fail-classification semantics for #340 (amend on #340). - Scenario contract enforcement subcommand + SEI_WORKFLOW_VARS_CM env approach (file new issue). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(seitask): register sei.io scheme + grant workflownodes RBAC Third manual fire of release-test surfaced two more contract bugs: 1. provision-validator-chain + provision-rpc-fleet failed at Create time with `no kind is registered for the type v1alpha1.SeiNodeDeployment in scheme`. cmd/seitask/main.go's kubeClientFromEnv built a controller-runtime client with client-go's built-in scheme (K8s types only); sei.io/v1alpha1 was never registered. Fix: local taskScheme registering builtin + sei.io/v1alpha1. 2. upload-report failed listing workflownodes: 403 from runner/rbac.yaml granting workflows (get) but not workflownodes. Fix: add workflownodes (get, list). Run-release-test's "Base URL is missing a protocol" was a downstream symptom — provision-snd never published endpoints to workflow-vars, so $(RPC_TM_RPC) couldn't resolve and the literal string passed through to the release-test image. Resolves automatically once #1 is fixed. Separately: Chaos Mesh Serial does NOT fail-fast on child Task errors — each WorkflowNode transitions to Accomplished=True on pod termination regardless of exit code, and Serial proceeds to the next child. Filed as separate follow-up; tracked as architectural concern, not in scope for this fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(seitask): build-time guards for scheme + scenario CM-name contracts Cross-review feedback from platform-engineer + kubernetes-specialist on the #339 chain of fixes: both reviewers independently noted that we've been discovering contract drift between the seitask binary internals and the scenario YAML / RBAC layer at first-fire instead of at build time. Each first-fire bug (#334, #337, #339) is the same shape: an internal helper has a convention, the scenario author has to mirror it manually, no test catches the drift. Two narrow guards land here, ranked by ROI: - TestTaskScheme_RoundTripsSND / _RoundTripsSeiNodeTask: would have caught the #339 scheme-registration bug at `go test`. Validates that the package-level taskScheme actually has every sei.io/v1alpha1 type the seitask subcommands construct via typed Create/Get. - TestScenarioYAMLs_CMNameMatchesWorkflowVarsName: would have caught the #337 CM-name drift at `go test`. Walks every scenario YAML in the opt-in allow list (release-test.yaml today), extracts the Workflow CR's metadata.name, asserts every envFrom configMapRef.name matches WorkflowVarsName(metadataName). Major-upgrade is excluded — its CM is bash-created with a different convention; revisit when the half-bash legacy retires. Defers (filed/tracked separately, not in scope for this PR): - RBAC vs kubebuilder-marker reconciliation test (kubernetes-specialist ranked #3; defer until a third recurrence). - Wrapper SA workflows: [patch] prereq for #340 path 1 (amend on #340). - EXIT_REASON write-once-or-fail-classification semantics for #340 (amend on #340). - Scenario contract enforcement subcommand + SEI_WORKFLOW_VARS_CM env approach (file new issue). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

bdchatham merged commit 6ecf5b9 into main May 21, 2026
5 checks passed

bdchatham mentioned this pull request May 21, 2026

fix(scenarios/release-test): align workflow-vars CM name with seitask convention #337

Merged

2 tasks

bdchatham mentioned this pull request May 21, 2026

Scenario contract enforcement: build-time guards + single-sourced CM name #341

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(taskruntime): fetch Workflow UID from apiserver (downward-API can't project it)#334

fix(taskruntime): fetch Workflow UID from apiserver (downward-API can't project it)#334
bdchatham merged 1 commit into
mainfrom
fix/workflow-uid-via-lookup

bdchatham commented May 21, 2026 •

edited

Loading

Uh oh!

cursor Bot commented May 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bdchatham commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Bug 1 (the one we hit): seitask: infra: downward-API env not projected: [SEI_WORKFLOW_NAME SEI_WORKFLOW_UID SEI_NAMESPACE]

Bug 2 (downstream): `provision-validator-chain` stuck in `CreateContainerConfigError`

Fix

Test plan

Uh oh!

cursor Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bdchatham commented May 21, 2026 •

edited

Loading

Bug 1 (the one we hit): `seitask: infra: downward-API env not projected: [SEI_WORKFLOW_NAME SEI_WORKFLOW_UID SEI_NAMESPACE]`

cursor Bot commented May 21, 2026 •

edited

Loading