Skip to content

fix(ci): fail-fast pre-flight + per-job permissions on release publish#255

Merged
githubrobbi merged 1 commit into
mainfrom
fix/ci-release-preflight-token-permissions
May 15, 2026
Merged

fix(ci): fail-fast pre-flight + per-job permissions on release publish#255
githubrobbi merged 1 commit into
mainfrom
fix/ci-release-preflight-token-permissions

Conversation

@githubrobbi
Copy link
Copy Markdown
Collaborator

What

Release pipeline #98 (v0.5.99) built binaries on all three platforms for ~32 minutes and then died in <1 second at the 📦 Create GitHub Release step with:

```
HTTP 403 — Resource not accessible by integration
```

Root cause was a repo-level setting flip: `Settings → Actions → General → Workflow permissions` had been switched from "Read and write" to "Read-only". That toggle clamps every job's `GITHUB_TOKEN` to read scope at runtime regardless of what the workflow file declares — so the workflow's top-level `permissions: contents: write` was silently downgraded to read. v0.5.96 (the previous successful release ~16 h earlier) ran with the same workflow file and succeeded, confirming the file was never the problem.

The org's audit log is not retrievable (Free-tier org — `gh api /orgs/skyllc-ai/audit-log` returns 404), so the actor/timestamp of the flip is unrecoverable.

Why this PR is needed even though the immediate symptom can be fixed by re-flipping the toggle

A repo-level toggle that bypasses every safeguard in YAML is a perfect silent-rot vector: the next time it's flipped (intentionally or accidentally — by you, a co-maintainer, an org admin, or a GitHub-side policy nudge), the next release will silently burn ~30 minutes of build time before dying at the publish step. This PR makes that scenario fail in ~1 second with a precise error message.

The two belt-and-suspenders changes

1. Pre-flight permissions probe in `release-preparation`

Creates a draft release with a throwaway tag (`preflight-permcheck-run<run_id>-attempt<run_attempt>`) and immediately deletes it on success. Exercises the EXACT REST path that `softprops/action-gh-release` uses 30+ minutes later in `create-github-release`, so a permissions clamp surfaces here in ~1 s with an actionable error message pointing at the toggle that needs flipping (repo level, with an org-level fallback note for free-tier orgs where the cascade can come from the org).

```yaml

  • name: Pre-flight — verify Actions token can create releases
    shell: bash
    env:
    GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
    run: |
    set -euo pipefail
    TEST_TAG="preflight-permcheck-run${{ github.run_id }}-attempt${{ github.run_attempt }}"
    if ! gh release create "$TEST_TAG" --repo "${{ github.repository }}" --draft …; then
    # ::error annotation + heredoc with full remediation steps
    exit 1
    fi
    gh release delete "$TEST_TAG" --repo "${{ github.repository }}" --yes || \
    echo "::warning::orphan is harmless"
    ```

Why a synthetic-release probe rather than a direct `/repos/.../actions/permissions/workflow` API call: that endpoint requires the `administration: read` scope, which neither this workflow nor the default `GITHUB_TOKEN` carries. Widening to add it would expand the permission surface. The create-draft probe stays inside `contents: write` which the job already declares (no new scope grants).

Cleanup safety: Draft releases do not create the underlying git tag (only published releases do), so a failed cleanup leaks at worst an invisible draft row — never an orphan tag. The throwaway tag name includes both `run_id` and `run_attempt` so concurrent or retried invocations cannot collide.

2. Explicit per-job `permissions:` block on `create-github-release`

Pins `contents: write`, `id-token: write`, `attestations: write` at the point of use rather than relying on inheritance from the workflow-level block ~520 lines up.

Does NOT change runtime behaviour by itself — the repo-level clamp still wins. But it pairs with the pre-flight to make the failure mode self-documenting from the YAML alone: a reader doesn't have to scroll 500 lines to learn which scopes this job needs.

```yaml
create-github-release:
name: 📦 Create GitHub Release

permissions:
contents: write # action-gh-release: create v* tag + release row
id-token: write # SLSA build-provenance: OIDC for Sigstore Fulcio
attestations: write # SLSA build-provenance: post attestation to repo
```

Verification

  • `actionlint .github/workflows/release.yml .github/workflows/release-cache-warm.yml` — clean.
  • Local `lint-pre-push` gate (22 stages, including `workflow-drift`) — ✅ all green in 51s.
  • The probe itself will be exercised on the next release dispatch. On the current repo state (workflow permissions = "read"), it will fail with the precise error message it's designed to surface; once the repo-level toggle is flipped back to "write", subsequent runs will pass the probe in ~1 s.

What it does NOT do

  • It does not flip the repo-level workflow-permissions toggle (that's a repo-admin action that should be done deliberately, not as a code change).
  • It does not change behaviour on the existing failing run — that one is permanently failed and needs a fresh dispatch.
  • It does not widen the workflow's permission surface — no new `administration:read` scope, no new tokens.

Recommended next steps after merge

  1. Flip the toggle: `Settings → Actions → General → Workflow permissions → Read and write permissions → Save` (or via API as documented in the probe's failure message).
  2. Re-dispatch `🚀 UFFS Release Pipeline` for `v0.5.99` (no version bump needed — Cargo.toml is already at 0.5.99, no `v0.5.99` tag exists yet, and the release workflow creates the tag atomically).
  3. The pre-flight probe will pass in <2 s; the rest of the pipeline proceeds.

Regression history: pipeline #98 / v0.5.99. Related prior PRs in the same release-stability sweep: #251 (show-binary-sizes shell bug), #254 (macOS rustup proxy post-cache-restore).

Release pipeline #98 (v0.5.99) burned ~32 min building binaries on all three platforms, then died in <1 s at the 'Create GitHub Release' step with the cryptic:

    HTTP 403 — Resource not accessible by integration

Root cause was a repo-level setting flip: 'Settings → Actions → General → Workflow permissions' had been switched from 'Read and write' to 'Read-only', which clamps every job's GITHUB_TOKEN to read scope at runtime regardless of what the workflow file declares.  v0.5.96 (the previous successful release ~16 h earlier) ran with the same workflow file and succeeded, confirming the file was never the problem.  The audit log is not retrievable (free-tier org), so the actor/timestamp of the flip is unrecoverable.

This commit prevents the next ~30 min of silent build time:

  1. Pre-flight permissions probe in 'release-preparation'.  Creates a draft release with a throwaway tag (run_id+attempt) and immediately deletes it on success.  Exercises the EXACT REST path that softprops/action-gh-release uses 30+ min later, so a permissions clamp surfaces in ~1 s with a precise error message pointing at the toggle that needs flipping (repo level, with an org-level fallback note).  Draft releases do not create git tags, so failed cleanup leaks at worst an invisible draft row — never an orphan tag.

  2. Explicit per-job 'permissions:' block on 'create-github-release' pinning 'contents: write', 'id-token: write', 'attestations: write'.  Documents the scope needs at the point of use rather than relying on inheritance from the top-level block ~520 lines up.  Does NOT change runtime behaviour by itself — the repo-level clamp still wins — but pairs with the pre-flight to make the failure mode self-documenting from the YAML alone.

Why a synthetic-release probe rather than a direct '/repos/.../actions/permissions/workflow' API call: that endpoint requires the 'administration: read' scope, which neither this workflow nor the default GITHUB_TOKEN carries.  Widening to add it would expand the permission surface; the create-draft probe stays inside 'contents: write' which the job already declares.

Local validation: actionlint clean on both touched workflow files; lint-fast + lint-pre-push will gate the push.
@githubrobbi githubrobbi enabled auto-merge (squash) May 15, 2026 18:36
@githubrobbi githubrobbi merged commit 971bf7c into main May 15, 2026
18 checks passed
@githubrobbi githubrobbi deleted the fix/ci-release-preflight-token-permissions branch May 15, 2026 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant