Skip to content

ci: shell-quote env values in system-integration workflow#536

Merged
hbrodin merged 2 commits intomainfrom
ci/fix-system-integration-env-quoting
Apr 27, 2026
Merged

ci: shell-quote env values in system-integration workflow#536
hbrodin merged 2 commits intomainfrom
ci/fix-system-integration-env-quoting

Conversation

@hbrodin
Copy link
Copy Markdown
Collaborator

@hbrodin hbrodin commented Apr 27, 2026

Summary

The weekly System Integration Tests workflow has been failing on every Sunday cron since 2026-02-01 — every run we have history for. The job dies in the very first deploy step:

./env: line 37: nabHtGdmw4...: command not found
make[1]: *** [Makefile:26: up] Error 127
make: *** [Makefile:95: deploy] Error 2

Root cause

Two compounding bugs in the Configure env file for minikube step:

  1. Deletion regex never matched. The step ran sed -i "/^KEY=/d" env for each secret key, but ci-env.template has every line prefixed with export (e.g. export OTEL_TOKEN=replace-me). The anchor ^OTEL_TOKEN= never matches export OTEL_TOKEN=, so the placeholder lines stayed in the file alongside the appended real values. For most secrets (plain alphanumeric) this was harmless because the second assignment shadowed the first at source time.

  2. Appended values were not shell-quoted. deployment/crs-architecture.sh:16 does source ./env, so each line is parsed as bash. The step appended values via echo "OTEL_TOKEN=${OTEL_TK}", with no quoting. The OTEL_TOKEN secret is a Bearer <token> header — the space causes bash to parse OTEL_TOKEN=Bearer as the assignment and then attempt to execute <token> as a command. Hence command not found and exit 127.

The same gap was a code-execution sink: a secret containing $(...) or backticks would have run on the runner at source time. Reproduction confirmed OTEL_TK="$(touch /tmp/PWNED)" produced the file under the previous code.

Fix

Two changes that reinforce each other:

  • .github/ci-env.template — drop the entire # Secrets/Env replaced block. Those export KEY=replace-me lines only existed because the pre-rewrite approach used sed -i "s|KEY=.*|KEY=...|", which needed substitution targets. The current approach doesn't, so they're vestigial. Removing them eliminates the whole delete-then-rewrite dance and the regex bug class with it.
  • .github/workflows/system-integration.yml — write each secret with printf 'export %s=%q\n'. %q produces a value that bash can re-parse as the literal string, regardless of whitespace, quotes, $(), backticks, or newlines.

After the change, the step is:

write_var() {
  printf 'export %s=%q\n' "$1" "$2" >> ./env
}
cp ../.github/ci-env.template ./env
write_var BUTTERCUP_NAMESPACE  "$BUTTERCUP_NS"
write_var OPENAI_API_KEY       "$OPENAI_KEY"
# ... etc

GHCR_AUTH is no longer defined anywhere, so [ -n "$GHCR_AUTH" ] in crs-architecture.sh:63 takes the warning-only else branch as intended (Docker pulls from a public ghcr.io image set in this CI configuration, no authentication required).

Verification

  • actionlint (with embedded shellcheck) clean on both changed files.
  • Reproduced the original failure shape under the new code: env file generates with shell-escaped values, source ./env succeeds, value reads back intact, no leftover replace-me lines, GHCR_AUTH correctly unset.
  • Reproduced the same input under the previous code: source fails with unexpected EOF and replace-me placeholders remain in the file (matches production failure pattern).
  • Reproduced the injection: under the previous code, a value containing $(touch ...) executed the command on source; under the new code it is escaped and stored as a literal string.

Test plan

  • Merge.
  • Trigger the workflow via workflow_dispatch to confirm make deploy proceeds past the env-source step before relying on the next Sunday cron.
  • Confirm the next scheduled Sunday run reaches at least the fuzzer-build step (subsequent steps depend on infrastructure beyond this fix).

🤖 Generated with Claude Code

Two bugs in the env file step prevented the weekly System Integration
Tests from ever passing since #427 (2026-01-26):

1. The deletion regex `/^KEY=/` did not match the template's
   `export KEY=replace-me` lines, leaving stale `replace-me` placeholders
   in the file alongside the appended real values.
2. Appended values were not shell-quoted, so `source ./env` in
   deployment/crs-architecture.sh treated whitespace and metacharacters
   as shell syntax. The `OTEL_TOKEN` secret (a `Bearer <token>` header)
   tripped this with `command not found` on the token component.

The same gap was a code-execution sink: a secret containing `$(...)` or
backticks would execute on the runner at source time.

Fix: strip with `(export[[:space:]]+)?` so template lines are actually
removed, and write with `printf 'export %s=%q\n'` so every value is
shell-escaped — defending against both whitespace and injection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The placeholder lines (`export KEY=replace-me`) only existed because the
pre-rewrite step substituted into them with `sed -i "s|KEY=.*|...|"`.
The new `printf %q` write path doesn't need them, so removing them at
the source eliminates the entire delete-then-rewrite dance.

Drops `strip_var` and the for-loop. The workflow step is now: copy the
static template, then append shell-quoted assignments for every secret
the env: block injects. GHCR_AUTH is no longer defined at all, so the
`[ -n "$GHCR_AUTH" ]` warning branch in crs-architecture.sh:63 fires as
intended.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hbrodin hbrodin merged commit a3d805c into main Apr 27, 2026
21 checks passed
@hbrodin hbrodin deleted the ci/fix-system-integration-env-quoting branch April 27, 2026 11:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants