Skip to content

fix(ci): repair macOS rustup proxy in release workflows + workspace-wide actionlint cleanup#245

Merged
githubrobbi merged 1 commit into
mainfrom
fix/release-cache-warm-macos-flake-2026-05-15
May 15, 2026
Merged

fix(ci): repair macOS rustup proxy in release workflows + workspace-wide actionlint cleanup#245
githubrobbi merged 1 commit into
mainfrom
fix/release-cache-warm-macos-flake-2026-05-15

Conversation

@githubrobbi
Copy link
Copy Markdown
Collaborator

Summary

Fixes the ~50% failure rate observed on release-cache-warm.yml's macOS leg over the past week, plus a workspace-wide actionlint cleanup of every other workflow.

Root cause

On macos-latest runner images the ~/.cargo/bin/cargo proxy is occasionally a stale symlink pointing at rustup-init (the installer binary) instead of the proxy that forwards to the active toolchain's real cargo. rustup show succeeded because it invoked rustup directly; the very next step's cargo build hit the broken proxy and fell through to the installer's argument parser:

error: error: unexpected argument 'build' found
Usage: rustup-init[EXE] [OPTIONS]
Stack backtrace:
   ...
   2: rustup::cli::setup_mode::main
   3: rustup_init::run_rustup_inner

Linux + Windows runners were unaffected.

Changes

release-cache-warm.yml::warm (the failing job) — three fixes

Fix What
A Install-step appends `rustup default "$(rustup show active-toolchain | awk '{print $1}')"` after `rustup show`. Rewrites the proxy binaries in `~/.cargo/bin` so symlinks resolve to the active toolchain's real cargo. Fail-fast smoke check (`cargo --version` / `rustc --version`) surfaces residual breakage in ~1s instead of letting Swatinem/rust-cache + `cargo build` waste 30+ min on the same root cause.
B Single auto-retry on the cargo-build step (2 attempts, 10-s sleep between). Absorbs transient network / sccache flakes without manual re-runs.
C `continue-on-error: true` at the warm-job level. The workflow's header comment already documents that cache-warming is best-effort by design; this matches that intent and stops noisy ❌ in PR checks for a single-platform image flake. Real regressions still surface via run history + per-job summary.

release.yml::build-release-binaries — Fix A mirrored

Same toolchain-install repair applied to the production release pipeline, because tag-dispatched runs would hit the same broken proxy on a freshly-flaky runner image — and on the production critical path a 30+ min late failure delays the release by a full re-run cycle.

Fixes B and C are NOT mirrored — `release.yml` is production critical-path; it should fail loudly so the `notify-failure` issue-opener fires.

Workspace-wide actionlint cleanup (no new warnings remain)

While auditing the two release workflows, every other workflow in `.github/workflows/` was actionlint-scanned and the pre-existing shellcheck warnings (all info / style level — none of them latent bugs) were resolved:

  • `release.yml` — 5 clusters: SC2086 (unquoted `$GITHUB_OUTPUT` / `$GITHUB_STEP_SUMMARY` writes), SC2129 (group individual redirects with `{ ... } >> "$out"`), SC2010 (`ls | grep` replaced with explicit glob loop over shipping-set names), SC2035 (`sha256sum *` / `shasum -a 256 ` switched to `./` form to guard against future filenames starting with `-`).
  • `auto-rerun-transient.yml` — 3× SC2016 false positives where literal backticks inside single-quoted `printf` format strings looked like command-substitution markers. Rewritten as `echo` with backslash-escaped backticks. No per-line suppression directive needed.
  • `cargo-vet-refresh.yml` — 1× SC2129 (grouped redirect).
  • `dependabot-review.yml` — 3× SC2129 (grouped redirects in delta-summary table, newly-resolved-crates block, dropped-crates block).

Verification

  • `actionlint .github/workflows/*.yml` → exit 0, zero warnings.
  • Pre-push gates green locally (lint-fast: 3s; lint-pre-push: 103s, all rust/dep/infra checks green).
  • Local bash test runs of the rewritten `echo` / grouped-redirect blocks produce byte-identical output to the originals (verified for both auto-rerun-transient.yml's 3 echo statements and dependabot-review.yml's table-rendering block).

Discipline notes (per request)

  • No suppression hacks. The SC2016 false positives in auto-rerun-transient.yml were resolved by switching from `printf` with single-quoted format strings to `echo` with escaped backticks — a real idiomatic rewrite, not a `# shellcheck disable` directive.
  • Surgical, root-cause fixes only. The macOS proxy bug is fixed at the install step (where the broken symlink lives); the retry + continue-on-error are scoped insurance, not workarounds.
  • Behavior preserved. All output is byte-identical to before; no public surface changed.
  • No tests dodged. This PR is workflow-only; no Rust test changes.
  • Atomic single-purpose commit, signed.

…ide actionlint cleanup

Root cause: `macos-latest` runner images ship with an occasionally-stale `~/.cargo/bin/cargo` symlink pointing at `rustup-init` (the installer) instead of the proxy that forwards to the active toolchain's real cargo.  `rustup show` succeeded (it invokes rustup directly), but the next step's `cargo build` hit the broken proxy and fell through to the installer's argument parser with `error: unexpected argument 'build' found`.  ~50% of `release-cache-warm.yml`'s macOS jobs were failing this way over the past week.

Three changes to release-cache-warm.yml::warm (the failing job):

  * Install-step now appends 'rustup default "$(rustup show active-toolchain | awk '{print $1}')"' after 'rustup show'.  Rewrites the proxy binaries in ~/.cargo/bin so symlinks resolve to the active toolchain's real cargo.  Fail-fast smoke check (cargo --version / rustc --version) surfaces any residual breakage in ~1s instead of letting Swatinem/rust-cache + cargo build burn 30+ min on the same root cause.

  * Single auto-retry on the cargo build step (2 attempts, 10-s sleep).  Absorbs transient network/sccache flakes without manual re-runs.

  * 'continue-on-error: true' at the warm-job level.  The workflow's own header comment documents that cache-warming is best-effort by design; this matches that intent and stops noisy ❌ in PR checks for a single-platform image flake.  Real regressions still surface via run history + per-job summary.

Mirror of fix #1 applied to release.yml::build-release-binaries — that workflow's tag-dispatched runs would hit the same broken proxy on a freshly-flaky runner image, and on the production release pipeline a 30+ min late failure delays the release by a full re-run cycle.  Fixes #2 and #3 are NOT mirrored: release.yml is production critical-path and should fail loudly so notify-failure fires.

Adjacent actionlint cleanup in the same PR (workspace-wide, zero new warnings):

  * release.yml — 5 clusters: SC2086 (unquoted $GITHUB_OUTPUT / $GITHUB_STEP_SUMMARY writes), SC2129 (group individual redirects with { ... } >> "$out"), SC2010 ('ls | grep' replaced with explicit glob loop over shipping-set names), SC2035 ('sha256sum *' / 'shasum -a 256 *' switched to './*' form to guard against future filenames starting with '-').

  * auto-rerun-transient.yml — 3x SC2016 false positives where literal backticks inside single-quoted printf format strings looked like command-substitution markers.  Rewritten as 'echo' with backslash-escaped backticks (idiomatic for 'print this template string').  No per-line suppression directive needed.

  * cargo-vet-refresh.yml — 1x SC2129 (grouped redirect).

  * dependabot-review.yml — 3x SC2129 (grouped redirects in delta-summary table, newly-resolved-crates block, dropped-crates block).

Verification: 'actionlint .github/workflows/*.yml' is now clean across every workflow (zero warnings, zero errors).  All script behavior is byte-identical to before; bash test runs of the rewritten echo / grouped-redirect blocks produce the same output as the originals.
@githubrobbi githubrobbi enabled auto-merge (squash) May 15, 2026 11:34
@githubrobbi githubrobbi merged commit 4daade9 into main May 15, 2026
18 checks passed
@githubrobbi githubrobbi deleted the fix/release-cache-warm-macos-flake-2026-05-15 branch May 15, 2026 11:48
githubrobbi added a commit that referenced this pull request May 15, 2026
…lds (#254)

* chore: development v0.5.99 - comprehensive testing complete [auto-commit]

* fix(ci): repair rustup proxy AFTER cache restore on macOS release builds

Release pipeline #97 (v0.5.98) failed on the aarch64-apple-darwin build with:

    error: unexpected argument 'build' found

    Usage: rustup-init[EXE] [OPTIONS]

    Stack backtrace: rustup_init::run_rustup_inner ...

PR #245 already added a proxy-repair line ('rustup default ...') to the 'Install pinned nightly' step, and the in-step 'cargo --version' smoke check still passes — but the very next step, 'Swatinem/rust-cache', restores '~/.cargo/bin/' from a prior run's cache.  On macos-latest that cache holds a poisoned 'cargo' symlink pointing at 'rustup-init' (the installer) rather than the toolchain proxy.  The restore overwrites the freshly-repaired '~/.cargo/bin/cargo', and the subsequent 'cargo build' picks up the poisoned proxy and exits with the rustup-init backtrace before reaching any cargo subcommand.

Empirical proof from the failing job (76219153458):

    16:36:34Z  cargo --version  ->  cargo 1.97.0-nightly  (proxy healthy)

    16:36:35Z  Swatinem/rust-cache: restoring ~/.cargo/bin/ ...

    16:36:54Z  cargo build         ->  rustup-init backtrace  (proxy poisoned)

Fix: add a second proxy-repair step *after* the cache-restore step and *before* the build step.  Re-running 'rustup default "$(rustup show active-toolchain | awk '{print $1}')"' rewrites the proxy binaries in '~/.cargo/bin' so the build resolves to the active toolchain's cargo even when the restored cache was poisoned.  On the next successful run, 'Swatinem/rust-cache' saves a clean cache and the poison clears itself.

Both 'release.yml::build-release-binaries' and 'release-cache-warm.yml::warm' get the same step (with matching commentary) — without the fix in the warm workflow, every macOS warm run would re-save the poisoned cache and perpetuate the regression.  Step is gated on 'runner.os == macOS' so Linux and Windows legs (which do not exhibit this symptom) are unaffected.

Pre-cache repair line is intentionally kept so the in-step 'cargo --version' smoke check still fails fast on a broken runner-image proxy.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant