Skip to content

To Prod#503

Merged
eskp merged 54 commits intoprodfrom
staging
Mar 5, 2026
Merged

To Prod#503
eskp merged 54 commits intoprodfrom
staging

Conversation

@eskp
Copy link
Member

@eskp eskp commented Mar 5, 2026

No description provided.

eskp and others added 30 commits March 4, 2026 14:15
The Workflow DevKit runtime has no abort mechanism, so after
the cancel endpoint sets status to "cancelled" the runtime
continues executing steps and writing "success" logs to the
DB. This causes steps to flash green and node borders to
stick on their last color.

Add three layers of defense:

- Server-side guards in workflow-logging.ts: logStepStartDb,
  logStepCompleteDb, updateCurrentStep, and
  incrementCompletedSteps all bail out when the execution is
  in a terminal state (cancelled/success/error)

- Cancel endpoint cleanup: mark any in-flight "running" step
  logs as "error" and protect the internal PATCH route from
  overwriting a cancelled execution

- Client-side fixes: Runs panel does one final log refresh
  when an execution transitions to terminal then stops
  polling it; toolbar resets nodes to idle on cancel; new
  runsRefreshTriggerAtom gives instant Run row appearance
  after clicking Run instead of waiting for the 2s poll
- Rename e2e-tests.yml to e2e-tests-local.yml with explicit label gating
- Remove workflow_run trigger that caused phantom CI runs on every PR push
- Add e2e-vitest-remote and e2e-playwright-remote to deploy-pr-environment.yaml
  gated by run-e2e-tests-pr-deploy label
- Add e2e-playwright-remote to deploy-keeperhub.yaml for post-deploy verification
- Vitest remote on PR envs uses kubectl port-forward to CloudNativePG and LocalStack
- No vitest-remote on staging/prod (direct DB writes risk corrupting live data)
- Add docs/testing/README.md with workflow architecture and design decisions
Check github.event.label.name to distinguish which label was just added.
Adding run-e2e-tests-pr-deploy no longer re-deploys the PR environment.
Adding unrelated labels no longer re-runs local e2e tests.

On synchronize events (new commits), jobs re-run if their labels are present.
Add TEST_API_KEY-gated admin endpoints so Playwright tests can retrieve
OTP codes and invitation IDs from deployed environments without direct
DB access. This removes the --grep-invert exclusions for invitation and
wallet tests on remote runs.

- keeperhub/lib/admin-auth.ts: timing-safe Bearer auth + @techops.services validation
- /api/admin/test/otp: OTP lookup via Drizzle
- /api/admin/test/invitation: invitation ID lookup via Drizzle
- Test utils auto-switch between API (remote) and direct DB (ephemeral)
- Preflight checks in global-setup.ts validate env vars before tests run
- Rename e2e-tests-local.yml -> e2e-tests-ephemeral.yml (jobs: ephemeral)
- Add TEST_API_KEY to deploy values (SSM) and workflow env
deploy-keeperhub.yaml referenced "E2E Tests Local" but the workflow
was renamed to "E2E Tests Ephemeral". workflow_run triggers match by
name, so the deploy pipeline would never trigger after ephemeral
tests pass.
Mermaid diagrams, workflow files table, and label reference still
used the old "local" naming. Updated to match the actual workflow
file and job names (ephemeral).
…module

The same function was duplicated in auth.ts and invitations.ts.
Extracted to admin-fetch.ts and imported from both.
The early return on length mismatch allowed an attacker to infer key
length via timing. Hash both values with SHA-256 first so
timingSafeEqual always compares fixed-length inputs.
The @techops.services email check is an authorization constraint, not
a validation error. 403 Forbidden is semantically correct.
…onment

The new remote test jobs used actions/checkout@v4 while the rest of
the file uses v6. Standardized to v6.
- Unit tests for authenticateAdmin and validateTestEmail (9 cases)
- Integration tests for GET /api/admin/test/otp (10 cases)
- Integration tests for GET /api/admin/test/invitation (7 cases)

Covers auth rejection, email domain restriction, DB success/empty/error
paths, and OTP value parsing.
Allows the e2e-playwright-remote job to access environment-scoped
secrets (TEST_API_KEY) needed for admin test API authentication.
…orward

Add AWS/kubectl/DB port-forward steps to the e2e-playwright-remote job
so persistent test users are seeded before tests run. Update
global-setup to run seed in remote mode when DATABASE_URL is available.
Replace workflow_run triggers with workflow_call reusable workflows
coordinated by orchestrator files. This associates workflow runs with
their source branch in the GitHub Actions UI.

- Add ci-pipeline.yml orchestrator (e2e-tests -> deploy)
- Add release-pipeline.yml orchestrator (release -> docs-sync)
- Convert e2e-tests-ephemeral, deploy-keeperhub, release, docs-sync
  to reusable workflows (workflow_call)
- Pass caller_event input to work around github.event_name always
  being 'workflow_call' in reusable workflows
- Simplify branch/SHA expressions by removing workflow_run fallbacks
- Add type headers (orchestrator/reusable) to all pipeline workflows
signOut() previously did nothing if the user menu wasn't visible,
allowing tests to silently proceed while still logged in. Now uses
expect assertions that throw on timeout, failing the test immediately.
Linear 500ms retries flake under load. Exponential backoff
(500ms -> 1s -> 2s -> 4s cap) with 8 retries gives the server
time to process OTP generation asynchronously.
Same fix as OTP polling: exponential backoff (500ms -> 4s cap)
instead of linear 500ms retries.
Stop forwarding error.message to the client in 500 responses.
The detail is still logged server-side via console.error.
Admin routes now return "Internal server error" instead of
forwarding error.message. Update test expectations to match.
deploy-keeperhub.yaml is now a reusable workflow called via
workflow_call from ci-pipeline.yml, not triggered by workflow_run.
…ction

Deduplicate ~40 lines of setup steps repeated across 6 jobs.
New .github/actions/setup-node-pnpm/action.yml with two boolean
inputs: install-playwright and discover-plugins.
- Rename "Local" context to "Ephemeral" to match naming convention
- Fix mermaid diagram: replace non-existent check-e2e-label with check-labels
- Update e2e-tests-ephemeral.yml trigger to reflect workflow_call pattern
- Correct remote vitest test count from 114+ to ~130
- Add composite action (setup-node-pnpm) documentation to workflow docs
…ignals

Add data attributes to app components for test automation:
- workflow-canvas: data-ready for canvas load state
- org-switcher: data-state for switching/loading/ready
- accept-invite: data-page-state for hydration state

Replace waitForTimeout/networkidle with element assertions:
- auth.setup: remove 2s sleeps, use domcontentloaded + org-switcher wait
- auth utils: replace networkidle with org-switcher visibility
- workflow utils: use data-ready, waitForURL, element assertions
- invitations: replace retry loop with data-page-state wait
- workflow.test/schedule-trigger.test: remove all waitForTimeout calls
- organization-wallet.test: replace toast race with element assertion

Stabilize playwright config:
- fullyParallel: false, workers: 1 (serial to avoid shared-state conflicts)
- retries: 2 (handles environmental flakiness)
- reporter: github + html in CI, list locally
- Unskip analytics-gas, scheduled-workflow, web3-balance, para-wallet tests
- Remove stop-execution.test.ts and ORG-4 placeholder (unimplemented UI)
- Fix Para Wallet "Create Wallet" ambiguous selector (data-slot scoping)
- Fix analytics-gas scrollIntoView race (wait for attachment first)
- Remove CI sharding from e2e-tests-ephemeral.yml (serial execution)
- Move regex literal to top-level scope (biome lint fix)
- Use consistent button[role="combobox"] selector in signIn()
- Update docs with CI execution model and stability decisions
suisuss and others added 15 commits March 5, 2026 19:26
The deployed PR environment runs NODE_ENV=development, not test/CI,
so rate limiting was still active. Add DISABLE_AUTH_RATE_LIMIT env var
to PR environment values and check it in the auth config.
Use better-auth customRules to bypass rate limiting when requests
include a valid X-Test-API-Key header. Playwright config sends this
header automatically when TEST_API_KEY env var is set. This keeps
rate limiting active for real users while allowing E2E tests to run
without hitting limits in PR, staging, and prod environments.
Add testFetch() and getTestHeaders() to vitest E2E utils for future
tests that hit auth endpoints. Also add X-Test-API-Key to Playwright
admin-fetch headers.
Fix customRules return type (return currentRule instead of undefined),
remove unused biome-ignore suppression, drop unnecessary async.
Remote and ephemeral E2E test jobs are disabled (if: false) across
deploy-keeperhub, deploy-pr-environment, and e2e-tests-ephemeral
workflows while auth/rate-limit infrastructure is being stabilised.
Remote tests gated by ENABLE_E2E_REMOTE_TESTS, ephemeral tests by
ENABLE_E2E_EPHEMERAL_TESTS. Both are GitHub repository variables.
Currently ephemeral=true, remote=false.
Condition-based branching workflows (e.g. parallel "Balance < 1 ETH" and
"Balance >= 1 ETH") incorrectly show "Error" status when one branch is
dead. Root cause: finalSuccess treats every result entry equally, so a
condition that fails because it references an unexecuted dead-branch
node poisons the entire run.

Three fixes:
- Track condition routing decisions (conditionDecisions map) and exclude
  nodes on not-taken branches from the finalSuccess calculation.
- Harden replaceTemplateVariable: when a referenced node exists in the
  graph but was never executed (dead branch), return undefined instead
  of throwing, so the condition evaluates gracefully to false.
- Add diagnostic logging when finalSuccess is false in a branching
  workflow to aid production debugging.
The Workflow DevKit's durability layer can throw errors after
withStepLogging has already recorded a step as successful. Previously
only "exceeded max retries" errors were reconciled (KEEP-1541). This
adds a second pass (reconcileSdkFailures) that catches any remaining
failed node whose step was recorded as successful, covering SDK errors
with different messages that surface during parallel/branching execution
(event log corruption, state replay mismatches, unexpected event types).
…lse-error-status

fix: KEEP-1512 condition node branching false error status
…retries-all-steps

fix: disable SDK retries on all web3 steps and match error formats
Cancelled step logs were incorrectly marked as "error". Added
"cancelled" to the log status type union to accurately reflect
user-initiated stops.
Update type annotations in workflow-runs, workflow-store, api-client,
and template-helpers to accept "cancelled" for step log status.
Cancelled runs were previously grouped under error. Now they appear as
their own status with orange styling in the time series chart, runs
table, and status filter dropdown.
feat: Add Stop mode to Run button to be able to cancel runs (workflows with Manual Trigger only)
@eskp eskp requested review from a team, OleksandrUA, joelorzet and suisuss and removed request for a team March 5, 2026 22:23
…ility

ci: restructure e2e test workflows and add admin test API
@eskp eskp merged commit e81d515 into prod Mar 5, 2026
16 of 29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants