feat: per-model datastore locks for concurrent model operations by stack72 · Pull Request #713 · systeminit/swamp

stack72 · 2026-03-15T00:19:18Z

Summary

Replaces the single global datastore lock with per-model locks so that operations on different models can run concurrently. This fixes the core issue where running model method run on one model blocked all other model operations, even on completely unrelated models.

Introduces per-model lock files at data/{modelType}/{modelId}/.lock — operations on different models no longer block each other
Refactors the sync coordinator from module-level singletons to a Map<string, SyncEntry> supporting multiple concurrent locks
Adds model-scoped S3 sync (pullChangedForModel) so S3 datastores only pull/push the relevant model's files
Extracts model references from workflows to acquire only the locks needed, with automatic fallback to global lock for dynamic CEL references
Updates breakglass commands (lock status, lock release --model) to show and manage per-model locks

Closes #706

Why per-model locks?

The global lock was designed for simplicity, but it creates an O(n) bottleneck for fleet operations. When managing 100+ VMs where each method call takes 30-90 seconds, a single global lock means:

Before: Running healthCheck on prod-vm-1 blocks a quick checkService on dev-vm-2 — the second command waits or times out
After: Both run concurrently because they acquire separate locks (data/command/shell/prod-vm-1/.lock and data/command/shell/dev-vm-2/.lock)

Structural commands (data gc, repo init, model create/delete) still use the global lock, and per-model lock acquisition checks for a held global lock first — so data integrity is preserved.

Impact on users

model method run on different models runs concurrently (the primary win)
workflow run acquires only the locks for models referenced in the workflow
model evaluate and workflow evaluate use per-model locks for single-model/workflow evaluation
S3 datastores benefit equally — model-scoped pull on lock acquisition, brief global lock only during push (seconds, not minutes)
No breaking changes — the lock file format and CLI interface are unchanged; existing repos work without migration
Deadlock prevention — locks are always acquired in sorted order when multiple models are involved

What changed

File	Change
`datastore_sync_coordinator.ts`	Refactored to `Map<string, SyncEntry>` with named lock support
`repo_context.ts`	Added `requireInitializedRepoUnlocked()`, `createModelLock()`, `acquireModelLocks()`
`s3_cache_sync.ts`	Added `pullChangedForModel()` for model-scoped S3 sync
`model_method_run.ts`	Uses per-model lock instead of global lock
`workflow_run.ts`	Extracts model refs, acquires per-model locks (falls back to global for dynamic refs)
`model_evaluate.ts`	Single-model evaluate uses per-model lock; `--all` keeps global
`workflow_evaluate.ts`	Single-workflow evaluate uses per-model lock; `--all` keeps global
`datastore_lock.ts`	`lock status` scans per-model locks; `lock release --model` added
`datastore_output.ts`	Shows lock scope (global vs per-model) in status output
`model_reference_extractor.ts`	New: extracts model references from workflows for lock targeting

Test plan

All 2968 existing tests pass
deno check, deno lint, deno fmt clean
New unit tests for coordinator, repo context helpers, and model reference extractor
Manual verification: two concurrent model method run on different models both proceed without blocking, each with its own .lock file
Binary compiles successfully

🤖 Generated with Claude Code

stack72 · 2026-03-15T00:21:02Z

Manual testing of compiled binary

Tested the compiled binary against a fresh filesystem datastore to verify per-model locks work end-to-end.

Setup

mkdir /tmp/swamp-lock-test
cd /tmp/swamp-lock-test
swamp repo init

Created two command/shell models with a 5-second sleep:

# models/command/shell/model-a.yaml
type: "command/shell"
typeVersion: 1
id: "a0000000-0000-4000-a000-000000000001"
name: model-a
version: 1
tags: {}
globalArguments: {}
methods:
  execute:
    arguments:
      run: "sleep 5"

# models/command/shell/model-b.yaml
type: "command/shell"
typeVersion: 1
id: "b0000000-0000-4000-b000-000000000002"
name: model-b
version: 1
tags: {}
globalArguments: {}
methods:
  execute:
    arguments:
      run: "sleep 5"

Concurrent execution test

Launched both model method runs in the background simultaneously:

swamp model method run model-a execute &
swamp model method run model-b execute &

After 2 seconds (while both were still running), checked for lock files:

$ find .swamp -name ".lock"
.swamp/data/command/shell/a0000000-0000-4000-a000-000000000001/.lock
.swamp/data/command/shell/b0000000-0000-4000-b000-000000000002/.lock

Two separate per-model lock files — one per model, no global lock.

Results

Both processes started at the same time and completed at the same time — no blocking:

# model-a
00:16:47.540 INF model·method·run·model-a·execute Found model "model-a" ("command/shell")
00:16:47.545 INF model·method·run·model-a·execute Executing method "execute"
00:16:52.729 INF model·method·run·model-a·execute Method executed
00:16:52.733 INF model·method·run·model-a·execute Method "execute" completed on "model-a"

# model-b
00:16:47.540 INF model·method·run·model-b·execute Found model "model-b" ("command/shell")
00:16:47.545 INF model·method·run·model-b·execute Executing method "execute"
00:16:52.723 INF model·method·run·model-b·execute Method executed
00:16:52.727 INF model·method·run·model-b·execute Method "execute" completed on "model-b"

Both started at 00:16:47.540 and finished at ~00:16:52.7 — running fully in parallel. With the old global lock, model-b would have waited ~5 seconds for model-a to release the lock before starting.

Replace the single global datastore lock with per-model locks so that operations on different models can run concurrently. Previously, running `model method run` on one model blocked all other model operations, even on completely unrelated models. With per-model locks, only operations on the same model wait for each other. Closes #706 Co-authored-by: Blake Irvin <blakeirvin@me.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Wrap lock-acquiring code paths in try/finally blocks so per-model locks are always released even when method execution, evaluation, or workflow evaluation throws. Also add staleness detection to the global lock polling loop so per-model commands don't hang indefinitely when a structural command crashes without releasing its lock. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

1. Global lock now waits for held per-model locks before acquiring, preventing races between structural commands and in-progress per-model operations. 2. S3 sync failures (pull and push) now throw instead of silently continuing with stale/lost data. 3. workflow_run.ts: flushModelLocks() in catch block wrapped in try/catch so original error is preserved if lock release fails. 4. model_method_run.ts: flushModelLocks() in finally block wrapped in try/catch so original error is preserved if lock release fails. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

1. Move per-model lock release AFTER S3 push completes. Previously locks were released before push, so if push failed another process could acquire the lock, pull stale S3 data, and overwrite local changes. 2. Re-check global lock after acquiring each per-model lock. If a structural command acquired the global lock between the initial check and per-model acquisition, release all per-model locks, wait for the global lock, and retry from scratch. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

1. pullChanged() and pullChangedForModel() now track failed downloads and throw with the list of failed files instead of returning a partial count. Prevents operating on incomplete/stale data. 2. pushChanged() now tracks failed uploads and throws instead of silently dropping files. Prevents data loss when S3 uploads fail. 3. Global lock path in registerDatastoreSyncNamed() now re-throws S3 pull errors instead of logging a warning and continuing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Wrap the flush function body in try/finally so per-model locks are released regardless of whether the S3 push succeeds. Previously, a push failure would throw before reaching lock release, leaving locks held until TTL expiry and blocking other operations on those models. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions

Adversarial Review

I've traced through all code paths, edge cases, and failure scenarios in this per-model locking PR. The implementation is well-designed with appropriate safeguards.

Medium

src/cli/repo_context.ts:468-474 — waitForPerModelLocks unbounded wait
- The while(true) loop waits indefinitely for per-model locks to be released before acquiring the global lock.
- Breaking scenario: If another process holds a per-model lock with an extremely large custom TTL (e.g., 24 hours configured via LockOptions), or if the staleness check has a clock skew issue (system clock jumps backward), this function blocks indefinitely.
- Mitigation: The staleness check at line 448 (Date.now() - acquiredAt <= info.ttlMs) means stale locks are ignored. Default TTL is 30s. This is acceptable given typical usage.
- Suggested improvement (optional): Add a configurable maximum wait time with fallback to warning + proceed.
src/cli/repo_context.ts:592 — Recursive retry without depth limit
- If acquireModelLocks detects a global lock during per-model acquisition, it releases everything and recursively calls itself.
- Breaking scenario: Pathological case where a structural command repeatedly acquires/releases the global lock at exactly the wrong timing, causing unbounded recursion.
- Mitigation: The wait loop at lines 577-589 blocks until global lock is released or stale, making the pathological case extremely unlikely in practice.
- Suggested improvement (optional): Add max retry count (e.g., 10) before throwing.
src/cli/repo_context.ts:608-617 — Partial lock acquisition on S3 pull failure
- If S3 pullChangedForModel fails mid-way through the lock acquisition loop, some per-model locks are held but the returned flush function is never constructed.
- Breaking scenario: Process A acquires lock for model/X, S3 pull for model/Y fails, error propagates up. Locks for model/X remain in the global entries map until CLI exit or SIGINT.
- Why it's OK: The locks ARE registered in the coordinator's entries map (line 558), so flushDatastoreSync at CLI exit releases them. SIGINT handler also covers this. TTL expiration handles truly orphaned locks. This is acceptable latency, not a leak.
src/cli/commands/datastore_lock.ts:128 — Per-model lock status only works for filesystem
- scanModelLocks only scans filesystem datastores. S3 per-model locks are not shown in lock status.
- Impact: Users with S3 datastores cannot inspect per-model lock state via the breakglass command.
- Why acceptable: S3 locks can be inspected directly via AWS console/CLI. This is a feature gap, not a bug.
src/cli/commands/workflow_run.ts:278-283 — No locks when model references don't resolve
- If modelRefs is non-empty but findDefinitionByIdOrName returns null for all (models don't exist), resolvedModels is empty and no locks are acquired.
- Why acceptable: The workflow will fail at runtime when the model_method task executes, surfacing a clear error. Acquiring no locks is the correct behavior for this error case.

Low

src/domain/workflows/model_reference_extractor.ts:83-93 — Missing nested workflow silently skipped
- If a workflow references a nested workflow that doesn't exist, nestedWorkflow is null and its model references aren't extracted.
- Impact: Potential under-locking if the nested workflow is created after extraction but before execution.
- Why acceptable: The workflow will fail at runtime anyway since the nested workflow doesn't exist. This is a deferred error, not a correctness issue.
Potential race between S3 index mutation and persistence
- pullChangedForModel mutates this.index.entries[rel].localMtime in memory (line 271) but doesn't persist immediately.
- Why acceptable: The index is best-effort optimization. Worst case is redundant file transfers, not data corruption.

Verified Correctness

Deadlock prevention: Lock ordering via sorted acquisition (line 526-529) is correct.
TOCTOU mitigation: Re-checking global lock after each per-model acquisition (lines 564-593) minimizes but doesn't eliminate the race. The remaining window is microseconds. Acceptable.
Path traversal protection: assertSafePath validates all paths derived from S3 index entries.
SIGINT cleanup: Handler at lines 66-76 releases all held locks with 5s timeout before force exit.
Lock heartbeat management: Recursive retry properly releases locks via flushDatastoreSyncNamed, which stops heartbeats.
S3 push atomicity: Global lock held during push (line 628-649) prevents concurrent pushers from clobbering each other.

Verdict

PASS — The per-model locking implementation is solid. The medium-severity items are edge cases with appropriate mitigations (TTL expiration, CLI-level cleanup, SIGINT handler). No blocking issues found.

it's been passed but not updated

stack72 marked this pull request as ready for review March 16, 2026 15:46

This comment was marked as outdated.

Sign in to view

stack72 and others added 2 commits March 16, 2026 16:19

stack72 force-pushed the fix/per-model-datastore-locks-706 branch from 1e75c71 to c8ebcc7 Compare March 16, 2026 16:21

This comment was marked as outdated.

Sign in to view

github-actions bot reviewed Mar 16, 2026

View reviewed changes

stack72 enabled auto-merge (squash) March 16, 2026 17:37

stack72 disabled auto-merge March 16, 2026 17:37

stack72 merged commit 1abd500 into main Mar 16, 2026
5 of 6 checks passed

stack72 deleted the fix/per-model-datastore-locks-706 branch March 16, 2026 17:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: per-model datastore locks for concurrent model operations#713

feat: per-model datastore locks for concurrent model operations#713
stack72 merged 6 commits intomainfrom
fix/per-model-datastore-locks-706

stack72 commented Mar 15, 2026

Uh oh!

stack72 commented Mar 15, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

github-actions bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

stack72 commented Mar 15, 2026

Summary

Why per-model locks?

Impact on users

What changed

Test plan

Uh oh!

stack72 commented Mar 15, 2026

Manual testing of compiled binary

Setup

Concurrent execution test

Results

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Adversarial Review

Medium

Low

Verified Correctness

Verdict

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant