Skip to content

fix(taskruntime): fetch Workflow UID from apiserver (downward-API can't project it)#334

Merged
bdchatham merged 1 commit into
mainfrom
fix/workflow-uid-via-lookup
May 21, 2026
Merged

fix(taskruntime): fetch Workflow UID from apiserver (downward-API can't project it)#334
bdchatham merged 1 commit into
mainfrom
fix/workflow-uid-via-lookup

Conversation

@bdchatham
Copy link
Copy Markdown
Collaborator

@bdchatham bdchatham commented May 21, 2026

Summary

First manual fire of the new release-test Workflow on harbor (post sei-protocol/platform#627 merge) surfaced two real bugs in the seitask + scenario YAML contract. Both are fixed here.

Bug 1 (the one we hit): seitask: infra: downward-API env not projected: [SEI_WORKFLOW_NAME SEI_WORKFLOW_UID SEI_NAMESPACE]

`keygen-admin` Task pod exited at startup. Root cause: `taskruntime.LoadWorkflowIdentity` expected three env vars projected via Kubernetes downward API, but the release-test scenario YAML's Task containers had no env block.

Worse: when we tried to add the downward-API projections, we discovered chaos-mesh stamps `chaos-mesh.org/workflow=` label on Task pods but does NOT stamp the Workflow CR's UID anywhere. The taskruntime/ownerref.go comment claiming chaos-mesh writes the UID to an annotation was aspirational; the reality on a v2.8.0 Task pod is no UID label, no UID annotation.

Bug 2 (downstream): `provision-validator-chain` stuck in `CreateContainerConfigError`

The Task pod referenced `workflow-vars-` ConfigMap via envFrom; that CM is created by keygen-admin. Because keygen-admin failed, the CM never existed and the kubelet couldn't start the container. Resolves automatically once keygen-admin succeeds.

Fix

`taskruntime.LoadWorkflowIdentity` signature change — now `(ctx, client.Client)` instead of `()`. Reads NAME + NAMESPACE from env (downward-API projectable), then fetches the Workflow CR's UID from the apiserver using NAME + NAMESPACE. `SEI_WORKFLOW_UID` env still works as a short-circuit (tests rely on it).

Subcommand callers (keygen, provision-snd, upload-report) reorder client construction before identity loading.

`scenarios/release-test.yaml` — every seitask Task container projects:
```yaml
env:

  • name: SEI_WORKFLOW_NAME
    valueFrom:
    fieldRef:
    fieldPath: metadata.labels['chaos-mesh.org/workflow']
  • name: SEI_NAMESPACE
    valueFrom:
    fieldRef:
    fieldPath: metadata.namespace
    ```

UID flows from the apiserver lookup in LoadWorkflowIdentity.

Runner subcommand unaffected — it doesn't call LoadWorkflowIdentity (verified). `major-upgrade.yaml` therefore doesn't need a scenario YAML update.

Test plan

  • `go test ./internal/taskruntime/...` passes including new apiserver-lookup test case (also covers workflow-not-found → Infra error, and env-UID short-circuit)
  • `go test ./...` passes
  • `golangci-lint run` clean
  • `scenarios/release-test.yaml` validates via `kubectl apply --dry-run=server --validate=strict` against the chaos-mesh.org Workflow CRD on harbor
  • After merge + image build + SCENARIO_REF bump in sei-protocol/platform's workflow.yaml: manually fire `release-test-workflow` CronJob and verify keygen-admin, provision-validator-chain, provision-rpc-fleet, run-release-test, upload-report all complete

🤖 Generated with Claude Code

…'t project it)

First end-to-end manual fire of the release-test Workflow surfaced two
real bugs:

1. **keygen-admin Task pod failed at startup with**
   `seitask: infra: downward-API env not projected: [SEI_WORKFLOW_NAME
   SEI_WORKFLOW_UID SEI_NAMESPACE]`.
   Root cause: chaos-mesh stamps `chaos-mesh.org/workflow=<name>` label
   on Task pods but does NOT stamp the workflow UID anywhere. The
   taskruntime/ownerref.go comment claiming chaos-mesh writes UID to an
   annotation was aspirational, not observed.
2. **provision-validator-chain Task pod stuck in CreateContainerConfigError**
   waiting for the workflow-vars ConfigMap that keygen-admin would have
   created had it not failed. Downstream effect of (1).

Fix in two parts:

**Go (this PR):**
- `taskruntime.LoadWorkflowIdentity` now takes `(ctx, client.Client)`.
  Reads NAME + NAMESPACE from env (projected via downward API by the
  scenario YAML), then fetches the Workflow CR's UID from the apiserver
  using NAME + NAMESPACE. SEI_WORKFLOW_UID env still works as a
  short-circuit override for tests.
- Subcommand callers (keygen, provision-snd, upload-report) reorder
  client construction before identity loading so the identity lookup
  has a client to work with.
- New test cases: apiserver-lookup path (UID from CR), workflow-not-
  found path (Infra error), env-short-circuit (UID env wins).

**Scenario YAML (this PR):**
- `scenarios/release-test.yaml`'s four seitask Task containers now
  project SEI_WORKFLOW_NAME via fieldRef on
  `metadata.labels['chaos-mesh.org/workflow']` and SEI_NAMESPACE via
  `metadata.namespace`. UID flows from the apiserver lookup.

The runner subcommand is unaffected — it doesn't call
LoadWorkflowIdentity (verified). major-upgrade.yaml's runner Tasks
therefore don't need a downward-API update.

Validated against the actual first-fire failure mode: kubectl apply
--dry-run=server passes; tests cover both paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cursor
Copy link
Copy Markdown

cursor Bot commented May 21, 2026

PR Summary

Medium Risk
Changes how seitask discovers its parent Workflow identity by introducing an apiserver lookup and updating scenario env projections; failures here would break task startup/ownerRef stamping across workflows.

Overview
Fixes seitask startup/cleanup wiring by changing taskruntime.LoadWorkflowIdentity to take (ctx, client) and fetch the parent Workflow UID from the Kubernetes apiserver when SEI_WORKFLOW_UID is not provided (still supported as a short-circuit).

Updates keygen, provision-snd, and upload-report to build the kube client before loading workflow identity, adjusts tests accordingly, and updates scenarios/release-test.yaml so each seitask Task container projects SEI_WORKFLOW_NAME and SEI_NAMESPACE via downward API.

Reviewed by Cursor Bugbot for commit a605b8e. Bugbot is set up for automated code reviews on this repo. Configure here.

@bdchatham bdchatham merged commit 6ecf5b9 into main May 21, 2026
5 checks passed
bdchatham added a commit that referenced this pull request May 21, 2026
… convention (#337)

Second bug from the post-#334 manual fire. keygen-admin creates the
workflow-vars CM via taskruntime.EnsureWorkflowVarsCM which derives the
CM name from the FULL Workflow CR name (release-test-${RUN_ID}). The
scenario YAML's three envFrom references used just ${SEI_WORKFLOW_RUN_ID}
(no scenario prefix), so the kubelet looked for a CM that didn't exist
and provision-validator-chain's pod stuck in CreateContainerConfigError
indefinitely.

Fix: align all three envFrom references to
workflow-vars-release-test-${SEI_WORKFLOW_RUN_ID}, matching what keygen
actually creates (and the admin-secret reference at line 155 already
uses the same workflow-name-based convention).

Convention: workflow-vars-<workflow-cr-name>. Sourced from
taskruntime.WorkflowVarsName which has no scenario knowledge and only
takes the Workflow CR name.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bdchatham added a commit that referenced this pull request May 21, 2026
Cross-review feedback from platform-engineer + kubernetes-specialist on
the #339 chain of fixes: both reviewers independently noted that we've
been discovering contract drift between the seitask binary internals
and the scenario YAML / RBAC layer at first-fire instead of at build
time. Each first-fire bug (#334, #337, #339) is the same shape: an
internal helper has a convention, the scenario author has to mirror it
manually, no test catches the drift.

Two narrow guards land here, ranked by ROI:

- TestTaskScheme_RoundTripsSND / _RoundTripsSeiNodeTask: would have
  caught the #339 scheme-registration bug at `go test`. Validates that
  the package-level taskScheme actually has every sei.io/v1alpha1 type
  the seitask subcommands construct via typed Create/Get.
- TestScenarioYAMLs_CMNameMatchesWorkflowVarsName: would have caught
  the #337 CM-name drift at `go test`. Walks every scenario YAML in
  the opt-in allow list (release-test.yaml today), extracts the
  Workflow CR's metadata.name, asserts every envFrom configMapRef.name
  matches WorkflowVarsName(metadataName). Major-upgrade is excluded —
  its CM is bash-created with a different convention; revisit when the
  half-bash legacy retires.

Defers (filed/tracked separately, not in scope for this PR):
- RBAC vs kubebuilder-marker reconciliation test (kubernetes-specialist
  ranked #3; defer until a third recurrence).
- Wrapper SA workflows: [patch] prereq for #340 path 1 (amend on #340).
- EXIT_REASON write-once-or-fail-classification semantics for #340
  (amend on #340).
- Scenario contract enforcement subcommand + SEI_WORKFLOW_VARS_CM env
  approach (file new issue).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bdchatham added a commit that referenced this pull request May 21, 2026
* fix(seitask): register sei.io scheme + grant workflownodes RBAC

Third manual fire of release-test surfaced two more contract bugs:

1. provision-validator-chain + provision-rpc-fleet failed at Create
   time with `no kind is registered for the type v1alpha1.SeiNodeDeployment
   in scheme`. cmd/seitask/main.go's kubeClientFromEnv built a
   controller-runtime client with client-go's built-in scheme (K8s
   types only); sei.io/v1alpha1 was never registered. Fix: local
   taskScheme registering builtin + sei.io/v1alpha1.

2. upload-report failed listing workflownodes: 403 from runner/rbac.yaml
   granting workflows (get) but not workflownodes. Fix: add workflownodes
   (get, list).

Run-release-test's "Base URL is missing a protocol" was a downstream
symptom — provision-snd never published endpoints to workflow-vars, so
$(RPC_TM_RPC) couldn't resolve and the literal string passed through to
the release-test image. Resolves automatically once #1 is fixed.

Separately: Chaos Mesh Serial does NOT fail-fast on child Task errors —
each WorkflowNode transitions to Accomplished=True on pod termination
regardless of exit code, and Serial proceeds to the next child. Filed
as separate follow-up; tracked as architectural concern, not in scope
for this fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(seitask): build-time guards for scheme + scenario CM-name contracts

Cross-review feedback from platform-engineer + kubernetes-specialist on
the #339 chain of fixes: both reviewers independently noted that we've
been discovering contract drift between the seitask binary internals
and the scenario YAML / RBAC layer at first-fire instead of at build
time. Each first-fire bug (#334, #337, #339) is the same shape: an
internal helper has a convention, the scenario author has to mirror it
manually, no test catches the drift.

Two narrow guards land here, ranked by ROI:

- TestTaskScheme_RoundTripsSND / _RoundTripsSeiNodeTask: would have
  caught the #339 scheme-registration bug at `go test`. Validates that
  the package-level taskScheme actually has every sei.io/v1alpha1 type
  the seitask subcommands construct via typed Create/Get.
- TestScenarioYAMLs_CMNameMatchesWorkflowVarsName: would have caught
  the #337 CM-name drift at `go test`. Walks every scenario YAML in
  the opt-in allow list (release-test.yaml today), extracts the
  Workflow CR's metadata.name, asserts every envFrom configMapRef.name
  matches WorkflowVarsName(metadataName). Major-upgrade is excluded —
  its CM is bash-created with a different convention; revisit when the
  half-bash legacy retires.

Defers (filed/tracked separately, not in scope for this PR):
- RBAC vs kubebuilder-marker reconciliation test (kubernetes-specialist
  ranked #3; defer until a third recurrence).
- Wrapper SA workflows: [patch] prereq for #340 path 1 (amend on #340).
- EXIT_REASON write-once-or-fail-classification semantics for #340
  (amend on #340).
- Scenario contract enforcement subcommand + SEI_WORKFLOW_VARS_CM env
  approach (file new issue).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant