fix(ci): fail-fast pre-flight + per-job permissions on release publish#255
Merged
Merged
Conversation
Release pipeline #98 (v0.5.99) burned ~32 min building binaries on all three platforms, then died in <1 s at the 'Create GitHub Release' step with the cryptic: HTTP 403 — Resource not accessible by integration Root cause was a repo-level setting flip: 'Settings → Actions → General → Workflow permissions' had been switched from 'Read and write' to 'Read-only', which clamps every job's GITHUB_TOKEN to read scope at runtime regardless of what the workflow file declares. v0.5.96 (the previous successful release ~16 h earlier) ran with the same workflow file and succeeded, confirming the file was never the problem. The audit log is not retrievable (free-tier org), so the actor/timestamp of the flip is unrecoverable. This commit prevents the next ~30 min of silent build time: 1. Pre-flight permissions probe in 'release-preparation'. Creates a draft release with a throwaway tag (run_id+attempt) and immediately deletes it on success. Exercises the EXACT REST path that softprops/action-gh-release uses 30+ min later, so a permissions clamp surfaces in ~1 s with a precise error message pointing at the toggle that needs flipping (repo level, with an org-level fallback note). Draft releases do not create git tags, so failed cleanup leaks at worst an invisible draft row — never an orphan tag. 2. Explicit per-job 'permissions:' block on 'create-github-release' pinning 'contents: write', 'id-token: write', 'attestations: write'. Documents the scope needs at the point of use rather than relying on inheritance from the top-level block ~520 lines up. Does NOT change runtime behaviour by itself — the repo-level clamp still wins — but pairs with the pre-flight to make the failure mode self-documenting from the YAML alone. Why a synthetic-release probe rather than a direct '/repos/.../actions/permissions/workflow' API call: that endpoint requires the 'administration: read' scope, which neither this workflow nor the default GITHUB_TOKEN carries. Widening to add it would expand the permission surface; the create-draft probe stays inside 'contents: write' which the job already declares. Local validation: actionlint clean on both touched workflow files; lint-fast + lint-pre-push will gate the push.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Release pipeline #98 (v0.5.99) built binaries on all three platforms for ~32 minutes and then died in <1 second at the
📦 Create GitHub Releasestep with:```
HTTP 403 — Resource not accessible by integration
```
Root cause was a repo-level setting flip: `Settings → Actions → General → Workflow permissions` had been switched from "Read and write" to "Read-only". That toggle clamps every job's `GITHUB_TOKEN` to read scope at runtime regardless of what the workflow file declares — so the workflow's top-level `permissions: contents: write` was silently downgraded to read. v0.5.96 (the previous successful release ~16 h earlier) ran with the same workflow file and succeeded, confirming the file was never the problem.
The org's audit log is not retrievable (Free-tier org — `gh api /orgs/skyllc-ai/audit-log` returns 404), so the actor/timestamp of the flip is unrecoverable.
Why this PR is needed even though the immediate symptom can be fixed by re-flipping the toggle
A repo-level toggle that bypasses every safeguard in YAML is a perfect silent-rot vector: the next time it's flipped (intentionally or accidentally — by you, a co-maintainer, an org admin, or a GitHub-side policy nudge), the next release will silently burn ~30 minutes of build time before dying at the publish step. This PR makes that scenario fail in ~1 second with a precise error message.
The two belt-and-suspenders changes
1. Pre-flight permissions probe in `release-preparation`
Creates a draft release with a throwaway tag (`preflight-permcheck-run<run_id>-attempt<run_attempt>`) and immediately deletes it on success. Exercises the EXACT REST path that `softprops/action-gh-release` uses 30+ minutes later in `create-github-release`, so a permissions clamp surfaces here in ~1 s with an actionable error message pointing at the toggle that needs flipping (repo level, with an org-level fallback note for free-tier orgs where the cascade can come from the org).
```yaml
shell: bash
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
set -euo pipefail
TEST_TAG="preflight-permcheck-run${{ github.run_id }}-attempt${{ github.run_attempt }}"
if ! gh release create "$TEST_TAG" --repo "${{ github.repository }}" --draft …; then
# ::error annotation + heredoc with full remediation steps
exit 1
fi
gh release delete "$TEST_TAG" --repo "${{ github.repository }}" --yes || \
echo "::warning::orphan is harmless"
```
Why a synthetic-release probe rather than a direct `/repos/.../actions/permissions/workflow` API call: that endpoint requires the `administration: read` scope, which neither this workflow nor the default `GITHUB_TOKEN` carries. Widening to add it would expand the permission surface. The create-draft probe stays inside `contents: write` which the job already declares (no new scope grants).
Cleanup safety: Draft releases do not create the underlying git tag (only published releases do), so a failed cleanup leaks at worst an invisible draft row — never an orphan tag. The throwaway tag name includes both `run_id` and `run_attempt` so concurrent or retried invocations cannot collide.
2. Explicit per-job `permissions:` block on `create-github-release`
Pins `contents: write`, `id-token: write`, `attestations: write` at the point of use rather than relying on inheritance from the workflow-level block ~520 lines up.
Does NOT change runtime behaviour by itself — the repo-level clamp still wins. But it pairs with the pre-flight to make the failure mode self-documenting from the YAML alone: a reader doesn't have to scroll 500 lines to learn which scopes this job needs.
```yaml
create-github-release:
name: 📦 Create GitHub Release
…
permissions:
contents: write # action-gh-release: create v* tag + release row
id-token: write # SLSA build-provenance: OIDC for Sigstore Fulcio
attestations: write # SLSA build-provenance: post attestation to repo
```
Verification
What it does NOT do
Recommended next steps after merge
Regression history: pipeline #98 / v0.5.99. Related prior PRs in the same release-stability sweep: #251 (show-binary-sizes shell bug), #254 (macOS rustup proxy post-cache-restore).