Skip to content

fix(extensions): re-attempt transient BundleBuildFailed instead of pinning it (#424)#1440

Merged
stack72 merged 2 commits into
mainfrom
fix/bundle-build-failed-not-pinned-424
May 23, 2026
Merged

fix(extensions): re-attempt transient BundleBuildFailed instead of pinning it (#424)#1440
stack72 merged 2 commits into
mainfrom
fix/bundle-build-failed-not-pinned-424

Conversation

@keeb
Copy link
Copy Markdown
Contributor

@keeb keeb commented May 23, 2026

Summary

swamp serve would permanently lose a custom (@user/*) model type after a single transient bundle failure. When the daemon first loaded an npm-dependent extension on a cold deno cache with no network (e.g. a fresh deploy, before any CLI run warmed .swamp/bundles/), deno bundle failed and the catalog recorded a sticky BundleBuildFailed row. Scheduled workflows then failed every run with Unknown model type: <type> — across daemon restarts and even after connectivity returned — while CLI swamp workflow run that had warmed a good bundle still worked.

Root cause: a BundleBuildFailed row carries an empty bundle_path, so in findStaleFiles it slipped past the missing-bundle staleness gate whenever its source fingerprint still matched, and was never re-bundled.

Fix: treat a fingerprint-matched BundleBuildFailed as stale so the next scan re-attempts the bundle and recovers once conditions change. This is not an aggressive retry loop — the next natural scan (restart / next load pass) fixes it. Deterministic ValidationFailed stays excluded (re-bundling it only thrashes). The cold-start ReconcileFromDisk path is intentionally unchanged: a failed bundle leaves the catalog populated, so daemon-restart recovery routes through this warm path. The fix is shared across all extension kinds (model/vault/driver/report).

Closes #424.

Test Plan

  • New unit test: findStaleFiles: matching fingerprint + state=BundleBuildFailed → stale (#424); the existing ValidationFailed → not stale regression guard still passes.
  • Updated the one vault-loader test that encoded the old pinning behavior to assert recover-on-rescan.
  • Full suite green (6182 passed, 0 failed); deno check / lint / fmt clean.
  • Live reproduction in an isolated systemd-run -p PrivateNetwork=yes namespace with a cold DENO_DIR: recreated the pinned state (BundleBuildFailedUnknown model type), then restarted serve online — the daemon recovered on its own (Indexed, bundle rebuilt, scheduled run resolves the type).

Out of scope

A separate datastore-base-path-changed divergence between serve startup and scheduled-execution contexts (surfaced during reproduction) causes an unnecessary catalog re-bundle on the first scheduled run of each startup. This fix makes that invalidation benign; the churn itself is a follow-up.

🤖 Generated with Claude Code

keeb and others added 2 commits May 23, 2026 14:48
…nning it (#424)

A `BundleBuildFailed` catalog row carries an empty `bundle_path`, so in
`findStaleFiles` it slipped past the missing-bundle staleness gate when its
source fingerprint still matched — and was never re-bundled. A `swamp serve`
daemon that first loaded an npm-dependent extension on a cold deno cache with
no network (e.g. a fresh deploy) pinned the type as `BundleBuildFailed` and
failed every scheduled run with `Unknown model type`, permanently — across
restarts and even after connectivity returned, while CLI runs that warmed a
good bundle still worked.

Treat a fingerprint-matched `BundleBuildFailed` as stale so the next scan
re-attempts the bundle and recovers once conditions change. Deterministic
`ValidationFailed` stays excluded — re-bundling it only thrashes. The
cold-start ReconcileFromDisk path is intentionally unchanged: a failed bundle
leaves the catalog populated, so daemon-restart recovery routes through this
warm path.

Verified against an isolated systemd-run reproduction: a daemon pinned
offline now recovers to `Indexed` on the next online restart.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The vault loader test asserted end-to-end recovery of a BundleBuildFailed row
through the loader's warm-start sourcePathIndex re-bundle path, using two
unrelated temp dirs. That path is sensitive to Windows path canonicalization
between unrelated directories and failed on windows-latest. The fix itself
(findStaleFiles treating a fingerprint-matched BundleBuildFailed as stale) is
covered directly and portably by the new bundle_freshness_test.ts unit test,
which passes on all platforms. Its prior assertion also encoded the old
pinning behavior, so it could not simply stay.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Clean, well-scoped fix for a real bug. The root cause analysis is thorough — a BundleBuildFailed row with an empty bundle_path and unchanged fingerprint slipped through both the fingerprint-mismatch gate and the missing-bundle gate, permanently pinning the failure.

Blocking Issues

None.

Suggestions

  1. The inline comment block at bundle_freshness.ts:312-318 (7 lines) is on the verbose side for this codebase. The first sentence and the BundleBuildFailed-vs-ValidationFailed distinction are load-bearing; the cold-start/ReconcileFromDisk detail could live in the design doc alone (which it already does). Minor — not blocking.

What looks good

  • Fix placement: The new BundleBuildFailed check is correctly positioned after the fingerprint match and before the bundle_path/bundleExists check, which is exactly where the bug lived.
  • Distinction preserved: ValidationFailed (deterministic) stays excluded from retry; BundleBuildFailed (transient) is now retried. The existing ValidationFailed → not stale test guards this boundary.
  • Test: New unit test directly exercises the exact scenario (matching fingerprint + empty bundle_path + BundleBuildFailed state). The removed vault-loader integration test was OS-fragile and is now replaced by a portable, more focused test at the right abstraction layer.
  • Design doc updated: The warm-start semantics in design/extension.md are updated to reflect the new behavior.
  • DDD: Change is entirely within the domain layer (src/domain/extensions/), touching only the staleness determination logic where it belongs. No layer violations.
  • Minimal blast radius: 4 files changed, the fix itself is a single 4-line conditional.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adversarial Review

Critical / High

None.

Medium

None.

Low

  1. src/domain/extensions/bundle_freshness_test.ts:608 — no Windows EBUSY .catch() on cleanup.
    The deleted vault loader test used Deno.remove(...).catch(() => {}) on Windows because it held a SQLite handle. The new freshness test only uses FakeCatalog (in-memory) and text files, so EBUSY is not a realistic concern — and the pattern is consistent with every other test in this file. Noting for completeness only.

  2. src/domain/extensions/bundle_freshness.ts:320 — every warm-start scan re-attempts a persistently failing BundleBuildFailed entry.
    If the underlying condition never resolves (e.g. a permanently broken npm dependency), every scan will re-invoke the bundler and fail again. This is bounded by scan frequency (not a tight retry loop) and the PR description explicitly acknowledges this tradeoff, so it's an acceptable design choice. If scan frequency ever increases, this could become noisy — but that's a future concern, not a current bug.

Verdict

PASS. Clean, minimal, well-placed fix. The new BundleBuildFailed check is inserted at exactly the right point in the decision chain — after fingerprint matching (so changed sources still take the normal path) and before the bundle-exists check (so the empty bundle_path on a failed row can't accidentally satisfy the old gate). The ValidationFailed exclusion is preserved correctly. The deleted vault loader test was asserting the old (now-incorrect) pinning behavior; its coverage is properly subsumed by the new unit test at the freshness layer. Logic, ordering, and edge cases check out.

@stack72 stack72 merged commit b33590c into main May 23, 2026
11 checks passed
@stack72 stack72 deleted the fix/bundle-build-failed-not-pinned-424 branch May 23, 2026 22:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: add AWS profile support to vault provider and auto-registration

2 participants