To Prod by eskp · Pull Request #503 · techops-services/keeperhub

eskp · 2026-03-05T22:23:58Z

No description provided.

The Workflow DevKit runtime has no abort mechanism, so after the cancel endpoint sets status to "cancelled" the runtime continues executing steps and writing "success" logs to the DB. This causes steps to flash green and node borders to stick on their last color. Add three layers of defense: - Server-side guards in workflow-logging.ts: logStepStartDb, logStepCompleteDb, updateCurrentStep, and incrementCompletedSteps all bail out when the execution is in a terminal state (cancelled/success/error) - Cancel endpoint cleanup: mark any in-flight "running" step logs as "error" and protect the internal PATCH route from overwriting a cancelled execution - Client-side fixes: Runs panel does one final log refresh when an execution transitions to terminal then stops polling it; toolbar resets nodes to idle on cancel; new runsRefreshTriggerAtom gives instant Run row appearance after clicking Run instead of waiting for the 2s poll

- Rename e2e-tests.yml to e2e-tests-local.yml with explicit label gating - Remove workflow_run trigger that caused phantom CI runs on every PR push - Add e2e-vitest-remote and e2e-playwright-remote to deploy-pr-environment.yaml gated by run-e2e-tests-pr-deploy label - Add e2e-playwright-remote to deploy-keeperhub.yaml for post-deploy verification - Vitest remote on PR envs uses kubectl port-forward to CloudNativePG and LocalStack - No vitest-remote on staging/prod (direct DB writes risk corrupting live data) - Add docs/testing/README.md with workflow architecture and design decisions

… and cleanup

Check github.event.label.name to distinguish which label was just added. Adding run-e2e-tests-pr-deploy no longer re-deploys the PR environment. Adding unrelated labels no longer re-runs local e2e tests. On synchronize events (new commits), jobs re-run if their labels are present.

Add TEST_API_KEY-gated admin endpoints so Playwright tests can retrieve OTP codes and invitation IDs from deployed environments without direct DB access. This removes the --grep-invert exclusions for invitation and wallet tests on remote runs. - keeperhub/lib/admin-auth.ts: timing-safe Bearer auth + @techops.services validation - /api/admin/test/otp: OTP lookup via Drizzle - /api/admin/test/invitation: invitation ID lookup via Drizzle - Test utils auto-switch between API (remote) and direct DB (ephemeral) - Preflight checks in global-setup.ts validate env vars before tests run - Rename e2e-tests-local.yml -> e2e-tests-ephemeral.yml (jobs: ephemeral) - Add TEST_API_KEY to deploy values (SSM) and workflow env

deploy-keeperhub.yaml referenced "E2E Tests Local" but the workflow was renamed to "E2E Tests Ephemeral". workflow_run triggers match by name, so the deploy pipeline would never trigger after ephemeral tests pass.

Mermaid diagrams, workflow files table, and label reference still used the old "local" naming. Updated to match the actual workflow file and job names (ephemeral).

…module The same function was duplicated in auth.ts and invitations.ts. Extracted to admin-fetch.ts and imported from both.

The early return on length mismatch allowed an attacker to infer key length via timing. Hash both values with SHA-256 first so timingSafeEqual always compares fixed-length inputs.

The @techops.services email check is an authorization constraint, not a validation error. 403 Forbidden is semantically correct.

…onment The new remote test jobs used actions/checkout@v4 while the rest of the file uses v6. Standardized to v6.

- Unit tests for authenticateAdmin and validateTestEmail (9 cases) - Integration tests for GET /api/admin/test/otp (10 cases) - Integration tests for GET /api/admin/test/invitation (7 cases) Covers auth rejection, email domain restriction, DB success/empty/error paths, and OTP value parsing.

Allows the e2e-playwright-remote job to access environment-scoped secrets (TEST_API_KEY) needed for admin test API authentication.

…orward Add AWS/kubectl/DB port-forward steps to the e2e-playwright-remote job so persistent test users are seeded before tests run. Update global-setup to run seed in remote mode when DATABASE_URL is available.

Replace workflow_run triggers with workflow_call reusable workflows coordinated by orchestrator files. This associates workflow runs with their source branch in the GitHub Actions UI. - Add ci-pipeline.yml orchestrator (e2e-tests -> deploy) - Add release-pipeline.yml orchestrator (release -> docs-sync) - Convert e2e-tests-ephemeral, deploy-keeperhub, release, docs-sync to reusable workflows (workflow_call) - Pass caller_event input to work around github.event_name always being 'workflow_call' in reusable workflows - Simplify branch/SHA expressions by removing workflow_run fallbacks - Add type headers (orchestrator/reusable) to all pipeline workflows

signOut() previously did nothing if the user menu wasn't visible, allowing tests to silently proceed while still logged in. Now uses expect assertions that throw on timeout, failing the test immediately.

Linear 500ms retries flake under load. Exponential backoff (500ms -> 1s -> 2s -> 4s cap) with 8 retries gives the server time to process OTP generation asynchronously.

Same fix as OTP polling: exponential backoff (500ms -> 4s cap) instead of linear 500ms retries.

Stop forwarding error.message to the client in 500 responses. The detail is still logged server-side via console.error.

Admin routes now return "Internal server error" instead of forwarding error.message. Update test expectations to match.

…eral

deploy-keeperhub.yaml is now a reusable workflow called via workflow_call from ci-pipeline.yml, not triggered by workflow_run.

…ction Deduplicate ~40 lines of setup steps repeated across 6 jobs. New .github/actions/setup-node-pnpm/action.yml with two boolean inputs: install-playwright and discover-plugins.

- Rename "Local" context to "Ephemeral" to match naming convention - Fix mermaid diagram: replace non-existent check-e2e-label with check-labels - Update e2e-tests-ephemeral.yml trigger to reflect workflow_call pattern - Correct remote vitest test count from 114+ to ~130 - Add composite action (setup-node-pnpm) documentation to workflow docs

…ignals Add data attributes to app components for test automation: - workflow-canvas: data-ready for canvas load state - org-switcher: data-state for switching/loading/ready - accept-invite: data-page-state for hydration state Replace waitForTimeout/networkidle with element assertions: - auth.setup: remove 2s sleeps, use domcontentloaded + org-switcher wait - auth utils: replace networkidle with org-switcher visibility - workflow utils: use data-ready, waitForURL, element assertions - invitations: replace retry loop with data-page-state wait - workflow.test/schedule-trigger.test: remove all waitForTimeout calls - organization-wallet.test: replace toast race with element assertion Stabilize playwright config: - fullyParallel: false, workers: 1 (serial to avoid shared-state conflicts) - retries: 2 (handles environmental flakiness) - reporter: github + html in CI, list locally

- Unskip analytics-gas, scheduled-workflow, web3-balance, para-wallet tests - Remove stop-execution.test.ts and ORG-4 placeholder (unimplemented UI) - Fix Para Wallet "Create Wallet" ambiguous selector (data-slot scoping) - Fix analytics-gas scrollIntoView race (wait for attachment first) - Remove CI sharding from e2e-tests-ephemeral.yml (serial execution) - Move regex literal to top-level scope (biome lint fix) - Use consistent button[role="combobox"] selector in signIn() - Update docs with CI execution model and stability decisions

The deployed PR environment runs NODE_ENV=development, not test/CI, so rate limiting was still active. Add DISABLE_AUTH_RATE_LIMIT env var to PR environment values and check it in the auth config.

Use better-auth customRules to bypass rate limiting when requests include a valid X-Test-API-Key header. Playwright config sends this header automatically when TEST_API_KEY env var is set. This keeps rate limiting active for real users while allowing E2E tests to run without hitting limits in PR, staging, and prod environments.

Add testFetch() and getTestHeaders() to vitest E2E utils for future tests that hit auth endpoints. Also add X-Test-API-Key to Playwright admin-fetch headers.

Fix customRules return type (return currentRule instead of undefined), remove unused biome-ignore suppression, drop unnecessary async.

Remote and ephemeral E2E test jobs are disabled (if: false) across deploy-keeperhub, deploy-pr-environment, and e2e-tests-ephemeral workflows while auth/rate-limit infrastructure is being stabilised.

Remote tests gated by ENABLE_E2E_REMOTE_TESTS, ephemeral tests by ENABLE_E2E_EPHEMERAL_TESTS. Both are GitHub repository variables. Currently ephemeral=true, remote=false.

Condition-based branching workflows (e.g. parallel "Balance < 1 ETH" and "Balance >= 1 ETH") incorrectly show "Error" status when one branch is dead. Root cause: finalSuccess treats every result entry equally, so a condition that fails because it references an unexecuted dead-branch node poisons the entire run. Three fixes: - Track condition routing decisions (conditionDecisions map) and exclude nodes on not-taken branches from the finalSuccess calculation. - Harden replaceTemplateVariable: when a referenced node exists in the graph but was never executed (dead branch), return undefined instead of throwing, so the condition evaluates gracefully to false. - Add diagnostic logging when finalSuccess is false in a branching workflow to aid production debugging.

The Workflow DevKit's durability layer can throw errors after withStepLogging has already recorded a step as successful. Previously only "exceeded max retries" errors were reconciled (KEEP-1541). This adds a second pass (reconcileSdkFailures) that catches any remaining failed node whose step was recorded as successful, covering SDK errors with different messages that surface during parallel/branching execution (event log corruption, state replay mismatches, unexpected event types).

…lse-error-status fix: KEEP-1512 condition node branching false error status

…retries-all-steps fix: disable SDK retries on all web3 steps and match error formats

Cancelled step logs were incorrectly marked as "error". Added "cancelled" to the log status type union to accurately reflect user-initiated stops.

Update type annotations in workflow-runs, workflow-store, api-client, and template-helpers to accept "cancelled" for step log status.

Cancelled runs were previously grouped under error. Now they appear as their own status with orange styling in the time series chart, runs table, and status filter dropdown.

feat: Add Stop mode to Run button to be able to cancel runs (workflows with Manual Trigger only)

…ility ci: restructure e2e test workflows and add admin test API

eskp and others added 30 commits March 4, 2026 14:15

docs: KEEP-1351 document test data seeding, wallet, Sepolia, secrets,…

8866825

… and cleanup

fix: KEEP-1351 correct workflow_run name to match renamed workflow

990231d

deploy-keeperhub.yaml referenced "E2E Tests Local" but the workflow was renamed to "E2E Tests Ephemeral". workflow_run triggers match by name, so the deploy pipeline would never trigger after ephemeral tests pass.

docs: KEEP-1351 fix stale local references to use ephemeral naming

5e54e01

Mermaid diagrams, workflow files table, and label reference still used the old "local" naming. Updated to match the actual workflow file and job names (ephemeral).

refactor: KEEP-1351 extract duplicate getAdminFetchHeaders to shared …

03ec125

…module The same function was duplicated in auth.ts and invitations.ts. Extracted to admin-fetch.ts and imported from both.

fix: KEEP-1351 eliminate key length timing leak in secureCompare

a2235bc

The early return on length mismatch allowed an attacker to infer key length via timing. Hash both values with SHA-256 first so timingSafeEqual always compares fixed-length inputs.

fix: KEEP-1351 return 403 for non-techops email domain restriction

f593ac1

The @techops.services email check is an authorization constraint, not a validation error. 403 Forbidden is semantically correct.

chore: KEEP-1351 standardize checkout action to v6 in deploy-pr-envir…

d30ec0d

…onment The new remote test jobs used actions/checkout@v4 while the rest of the file uses v6. Standardized to v6.

refactor: extract isRemoteMode to shared util

c1c925b

fix: correct stale comment in e2e-tests-ephemeral workflow

90de3c5

fix: add CF Access headers to remote e2e job in deploy workflow

a0282b5

style: fix lint errors in admin test files

bcf08f7

fix(ci): KEEP-1351 add staging environment to playwright remote job

5ccb9ca

Allows the e2e-playwright-remote job to access environment-scoped secrets (TEST_API_KEY) needed for admin test API authentication.

fix(ci): KEEP-1351 seed test users in remote playwright via DB port-f…

0c20828

…orward Add AWS/kubectl/DB port-forward steps to the e2e-playwright-remote job so persistent test users are seeded before tests run. Update global-setup to run seed in remote mode when DATABASE_URL is available.

fix: KEEP-1351 make signOut throw when user menu is not visible

4dbedf5

signOut() previously did nothing if the user menu wasn't visible, allowing tests to silently proceed while still logged in. Now uses expect assertions that throw on timeout, failing the test immediately.

fix: KEEP-1351 add exponential backoff to OTP API polling

122a996

Linear 500ms retries flake under load. Exponential backoff (500ms -> 1s -> 2s -> 4s cap) with 8 retries gives the server time to process OTP generation asynchronously.

fix: KEEP-1351 add exponential backoff to invitation API polling

5b7b488

Same fix as OTP polling: exponential backoff (500ms -> 4s cap) instead of linear 500ms retries.

fix: KEEP-1351 return generic error message in admin test API 500s

bad1f2e

Stop forwarding error.message to the client in 500 responses. The detail is still logged server-side via console.error.

test: KEEP-1351 fix assertions to match sanitized error responses

6f600cb

Admin routes now return "Internal server error" instead of forwarding error.message. Update test expectations to match.

chore: KEEP-1351 standardize checkout action to v6 in e2e-tests-ephem…

0ef8042

…eral

docs: KEEP-1351 fix stale workflow_run reference in testing docs

5b3ccf1

deploy-keeperhub.yaml is now a reusable workflow called via workflow_call from ci-pipeline.yml, not triggered by workflow_run.

refactor: KEEP-1351 extract node/pnpm/playwright setup to composite a…

c3d0e88

…ction Deduplicate ~40 lines of setup steps repeated across 6 jobs. New .github/actions/setup-node-pnpm/action.yml with two boolean inputs: install-playwright and discover-plugins.

suisuss and others added 15 commits March 5, 2026 19:26

fix: KEEP-1351 disable auth rate limiting in PR environments

20c80b7

The deployed PR environment runs NODE_ENV=development, not test/CI, so rate limiting was still active. Add DISABLE_AUTH_RATE_LIMIT env var to PR environment values and check it in the auth config.

fix: KEEP-1351 add rate limit bypass header to vitest E2E utils

e20af8d

Add testFetch() and getTestHeaders() to vitest E2E utils for future tests that hit auth endpoints. Also add X-Test-API-Key to Playwright admin-fetch headers.

fix: KEEP-1351 fix lint and type errors in rate limit bypass

60b9dc1

Fix customRules return type (return currentRule instead of undefined), remove unused biome-ignore suppression, drop unnecessary async.

ci: KEEP-1351 temporarily disable all E2E test jobs

b2608bc

Remote and ephemeral E2E test jobs are disabled (if: false) across deploy-keeperhub, deploy-pr-environment, and e2e-tests-ephemeral workflows while auth/rate-limit infrastructure is being stabilised.

ci: KEEP-1351 gate E2E tests behind repo variables

e52d337

Remote tests gated by ENABLE_E2E_REMOTE_TESTS, ephemeral tests by ENABLE_E2E_EPHEMERAL_TESTS. Both are GitHub repository variables. Currently ephemeral=true, remote=false.

Merge pull request #501 from techops-services/fix/condition-branch-fa…

dfda2fe

…lse-error-status fix: KEEP-1512 condition node branching false error status

Merge branch 'staging' into feat/KEEP-1545-fix-max-retries-all-steps

9fd550e

Merge pull request #498 from techops-services/feat/KEEP-1545-fix-max-…

9b935fd

…retries-all-steps fix: disable SDK retries on all web3 steps and match error formats

fix: Use cancelled status for in-flight step logs on execution stop

8acf2da

Cancelled step logs were incorrectly marked as "error". Added "cancelled" to the log status type union to accurately reflect user-initiated stops.

fix: Add cancelled to step log status types across codebase

0b03dbc

Update type annotations in workflow-runs, workflow-store, api-client, and template-helpers to accept "cancelled" for step log status.

feat: Show cancelled as distinct status in analytics dashboard

5245ca5

Cancelled runs were previously grouped under error. Now they appear as their own status with orange styling in the time series chart, runs table, and status filter dropdown.

Merge pull request #482 from techops-services/feature/stop-execution-fix

f99a3af

feat: Add Stop mode to Run button to be able to cancel runs (workflows with Manual Trigger only)

eskp requested review from a team, OleksandrUA, joelorzet and suisuss and removed request for a team March 5, 2026 22:23

Merge pull request #472 from techops-services/feat/KEEP-1351-e2e-stab…

2646a3d

…ility ci: restructure e2e test workflows and add admin test API

joelorzet approved these changes Mar 5, 2026

View reviewed changes

suisuss temporarily deployed to staging March 5, 2026 22:47 — with GitHub Actions Inactive

suisuss had a problem deploying to staging March 5, 2026 23:02 — with GitHub Actions Failure

suisuss temporarily deployed to staging March 5, 2026 23:09 — with GitHub Actions Inactive

suisuss had a problem deploying to staging March 5, 2026 23:19 — with GitHub Actions Failure

suisuss temporarily deployed to staging March 5, 2026 23:28 — with GitHub Actions Inactive

suisuss temporarily deployed to staging March 5, 2026 23:35 — with GitHub Actions Inactive

eskp merged commit e81d515 into prod Mar 5, 2026
16 of 29 checks passed

suisuss temporarily deployed to staging March 5, 2026 23:44 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

To Prod#503

To Prod#503
eskp merged 54 commits intoprodfrom
staging

eskp commented Mar 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

eskp commented Mar 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants